=for comment
 This document is in Perl Plain Old Documentation (POD) format.
 Converters are available from POD to many text formats.
 Ask me for details.


=head1 Resource Locking with Semaphore Files

=head2 Sean M. Burke

"When is it for?"

-- Brain Eno and Peter Schmidt, I<Oblique Strategies>

PRODUCTION: indent the above epigraph

The worst kinds of bugs are the ones that don't appear during
development, but then randomly appear only in real use. In the case of a
complicated program running on several different platforms, such
problems are not too surprising; but the first time I ran into such a
problem was is a very simple program that ran on the same machine I
developed it on.  It was a simple SSI counter for a web page, and it
looked like this:

  open COUNTER, "<counter.dat"
   or die "Can't read-open: $!";
  my $hits = <COUNTER>;
  close(COUNTER);

  ++$hits;
  print "Hits on this page: $hits\n";

  open COUNTER, ">counter.dat"
   or die "Can't write-open: $!";
  print COUNTER $hits;
  close(COUNTER);

I got it going and it all seemed to work fine:

  % perl -cw counter.pl
  counter.pl syntax OK
  
  % echo 0 > counter.dat ; chmod a+rw counter.dat

  % perl -w counter.pl
  Hits on this page: 1
  
  % perl -w counter.pl
  Hits on this page: 2

  % perl -w counter.pl
  Hits on this page: 3

I tested it in an F<.shtml> web page and in the browser it merrily
displayed "Hits on this page: 4", then on reloading displayed "Hits on
this page: 5", and so on.  When the web page was put on a public site, it
dutifully started reporting "Hits on this page: 249", and I'd check back
later and see "Hits on this page: 634", and everything seemed fine.  But
then I'd look back later and see "Hits on this page: 45".  Something was
clearly amiss.  But I could see absolutely nothing wrong with the tiny
counter program.  So I sought the advice of others, and they pointed out
to me the problem that I will now explain to you.

We as programmers are used to putting ourselves in the shoes of our
program, and relating to it as an individual: "What file should I<I>
open now?  What do I<I> do if I can't open that file? What  do I<I> do if
that other program went and deleted that file?" and so on.  But
this handy metaphor of ours breaks down where we need to imagine
other I<simultaneous> instances of our program following the same set of
instructions.  And that's just how
the above counter program was getting into trouble.  In testing, I never
had two instances of the program running at once; but once the counter was
a publicly visible web page, there eventually did get to be two
instances of the counter running at once, with various unfortunate
results.


=head2 Problems with Simultaneous Instances

Imagine that two people, at about the same instant, are accessing the
web page with the counter discussed above.  This leads the Web server to
start up an instance of F<counter.pl> for each user, at slightly different
times.  Suppose that the content of F<counter.dat> at the beginning is
the number "1000", and let's trace what each instance does.

PRODUCTION: I use indenting to differentiate the two instances, but it
could be reformatted to not be so wide, and/or color/typeface could
be used to differentiate them.


     Instance 1                                    Instance 2
     -----------------                             -----------------

      open COUNTER, "<counter.dat"
       or die "Can't read-open: $!";
      my $hits = <COUNTER>;
      close(COUNTER);

So instance 1 has read "1000" into C<$hits>.  Then:

                                                    open COUNTER, "<counter.dat"
                                                     or die "Can't read-open: $!";
                                                    my $hits = <COUNTER>;
                                                    close(COUNTER);

Instance 2 has read "1000" into C<$hits>.  Then:

      ++$hits;
      print "Hits on this page: $hits\n";

                                                    ++$hits;
                                                    print "Hits on this page: $hits\n";

Each instance increments its C<$hits> and each gets 1001, and each displays that
figure to its respective user.  Then:

      open COUNTER, ">counter.dat"
       or die "Can't write-open: $!";
      print COUNTER $hits;
      close(COUNTER);

Instance 1 has updated F<counter.dat> to 1001, and then ends.  Then
finally:

                                                    open COUNTER, ">counter.dat"
                                                     or die "Can't write-open: $!";
                                                    print COUNTER $hits;
                                                    close(COUNTER);

Instance 2 has updated F<counter.dat> to 1001.  The problem is that this
is incorrect; even though we served the page twice, the counter ends up
only 1 hit greater.  That's beside the fact that we just told two
different users that they were both the 1001st viewer of this page, whereas
one was really the 1002nd.

Here's a more drastic case: imagine that the two instances are a bit
more out of phase.  Suppose where instance 1 is writing the value "1501"
to F<counter.dat> as instance 2 is starting up and reading it:


     Instance 1                                    Instance 2
     -----------------                             -----------------

      open COUNTER, ">counter.dat"
       or die "Can't write-open: $!";
                                                    open COUNTER, "<counter.dat"
                                                     or die "Can't read-open: $!";
                                                    my $hits = <COUNTER>;
      print COUNTER $hits;
      close(COUNTER);

There, instance 1 overwrites F<counter.dat> (with a zero-length file),
but just as it's about to write the new value of its C<$hits>,
instance 2 opens that 0-length file, and reads from it into its
C<$hits>.  Reading from a 0-length file is just like reading from the 
end of any file: it returns undef.  Then instance 1 goes and writes
"1501" to F<counter.dat>, and ends.  But instance 2 is still working:

                                                    ++$hits;
                                                    print "Hits on this page: $hits\n";

                                                    open COUNTER, ">counter.dat"
                                                     or die "Can't write-open: $!";
                                                    print COUNTER $hits;
                                                    close(COUNTER);

It has incremented C<$hits>, and incrementing an undef value gives you
1.  It then tells the user "Hits on this page: 1", and now updates the
F<counter.dat> with a new value: 1.  Our counter just went from 1501 to
1!

Each program was perfectly following its own instructions, but together
they managed to be wrong: each told its user that it was the 1001st hit
in this page, and each updated the F<counter.dat> file with that same
figure.  I had tacitly assumed was that this case where two instance
coincide, would never happen; but I never actually put anything in place
to stop this from happening.  Or maybe I'd assumed it I<could> happen,
but that the chances were astronomical.  And after all, "it's just a
stupid web page counter anyway".  But anything worth doing, is worth
doing right, and what needed doing here was some way to make sure that
the above scenarios couldn't happen.  Moreover, the way to get this
counter program from losing its count, is also the same way we keep more
important data from being lost in other programs: file locking, a Unix
OS feature that's meant to help in just these sorts of cases.

Now a first hack at using file-locking would change the program to read
like this:

  use Fcntl ':flock';  # import LOCK_* constants

  open COUNTER, "<counter.dat"
   or die "Can't read-open: $!";
  flock COUNTER, LOCK_EX;
    # So only one instance gets to access this at a time!
  my $hits = <COUNTER>;
  close(COUNTER);

  ++$hits;
  print "Hits on this page: $hits\n";

  open COUNTER, ">counter.dat"
   or die "Can't write-open: $!";
  flock COUNTER, LOCK_EX;
    # So only one instance gets to access this at a time!
  print COUNTER $hits;
  close(COUNTER);

PRODUCTION: bold the two "flock COUNTER, LOCK_EX;" lines, and
ital the two "# So only one instance gets to access this at a time!"
lines.

So when a given program instance calls "C<flock I<FH>, LOCK_EX>" on a
given filehandle, it is signaling, via the operating system, that it
wants exclusive access to that file; and if some other process has just
called "C<flock I<FH>, LOCK_EX>" first, then our instance will wait
around until it's done.  And similarly, once we get a lock on this file,
if any other process calls "C<flock I<FH>, LOCK_EX>", the OS will make
it wait until we're done.  The way one signals that the above program
signals that it's done, is by calling C<close> on the filehandle.
Although it could have called C<flock COUNTER, LOCK_UN>, it's enough to
just close it, because of these important facts about locking in the
basic Unix file model:

=over

=item *

You can't lock a file until you've already opened it.

=item *

When you close a file, you give up any lock you have on it.

=item *

If a process dies while it has a file open, the file gets closed.

=item *

So the only way a file can be locked at any moment is if a process had
opened it, and then locked it, and hasn't yet closed it (either
specifically, or by ending).

=back

Unfortunately, this means trouble for our C<flock>-using code.
Notably, there can still be a problem with instances
being out of phase -- since we can't lock a file without already having
opened it, things can still happen in the brief moment between
opening the file, and locking it.  Consider when one instance is updating
F<counter.dat> just as another new instance is about to read it:


     Instance 1                                    Instance 2
     -----------------                             -----------------
      open COUNTER, ">counter.dat"
       or die "Can't write-open: $!";
                                                    open COUNTER, "<counter.dat"
                                                     or die "Can't read-open: $!";
                                                    flock COUNTER, LOCK_EX;
                                                    my $hits = <COUNTER>;
                                                    close(COUNTER);

      flock COUNTER, LOCK_EX;

There, the OS dutifully kept there from being two instances at once with
an exclusive lock on the file. But the locking is too late, because
instance 1, just by opening the file, has already overwritten
F<counter.dat> with a zero-length file just as instance 2 was about to
read it.  So we're back to the same problem that we had before we had any
C<flock> calls at all: two processes accessing a file that we wish only
one process at a time could access.


=head2 Semaphore Files

There are various special solutions to problems like the above, but the
most general one is semaphore files. The line of reasoning behind them
goes like this: Since you can't lock a file until you've already opened
it, any content you have in locked files still isn't safe. So just don't
have any content at all in a locked file. However, we I<do> have content
we need to protect, namely the data in F<counter.dat>. But that just
means we can't use that as the file we go locking. Instead, we'll use
some other file, never with any content of interest, whose only purpose
will be to be a thing that difference instances can lock for as long as
they want access to F<counter.dat>. The file that we lock but never
store anything in, we call a B<semaphore file>.

The way we actually use a semaphore file is by opening it and locking it
before we access some other real resource (like a counter file), and then
not closing the semaphore file until we're done with the real resource.
So we can go back to our original program and make it safe by just adding
code at the beginning to open a semaphore file, and one line at the end
to close it:

  use Fcntl ':flock';  # import LOCK_* constants
  open SEM, ">counter.sem"
   or die "Can't write-open counter.sem: $!";
  flock SEM, LOCK_EX;

  open COUNTER, "<counter.dat"
   or die "Can't read-open counter.dat: $!";
  my $hits = <COUNTER>;
  close(COUNTER);

  ++$hits;
  print "Hits on this page: $hits\n";

  open COUNTER, ">counter.dat"
   or die "Can't write-open counter.dat: $!";
  print COUNTER $hits;
  close(COUNTER);

  close(SEM);

This avoids all the problems we saw earlier; since the above program
doesn't do anything with F<counter.dat> until it has an exclusive lock
on F<counter.sem>, and don't give up that lock until it's done, that
means that there is every only one instance of the above program
accessing F<counter.dat> at a time.

It can still happen that some other program alters F<counter.dat>
without first locking F<counter.sem> -- so don't do that!  As long as
every process locks the appropriate semaphore file while it's working on
a given resource, all is well.  All that you need to do is settle on
some correspondence between file(s), and the semaphore file that
controls access for them.  It's a purely arbitrary choice, but when
naming a semaphore file for a resource F<file.ext>, I tend to name the
semaphore file F<file.sem>, F<file.ext.sem>, or F<file.ext_S>.  As
with any arbitrary decision, I advise picking one style and sticking
with it -- clearly the whole purpose of this is defeated if one program
looks to F<counter.sem> as the semaphore file, while another looks to
F<counter.dat_S>.


=head2 Semaphore Objects

With our simple counter program, our simplistic but effective approach
was to just bracket our program with this code;

  use Fcntl ':flock';  # import LOCK_* constants
  open SEM, ">counter.sem"
   or die "Can't write-open counter.sem: $!";
  flock SEM, LOCK_EX;
  
 ...do things...
 
  close(SEM);
  
 ...do anything else that doesn't involve counter.sem...

That works quite well when our program is simple and involves just one
semaphore file -- all we need to do is C<close(SEM)> once we're done
with F<counter.sem> or whatever resource the SEM filehandle denotes a
lock for.  However, when a given program involves a lot of different 
files which each require its own semaphore file, and which are being
locked and unlocked in arbitrary orders, then you can't just have them
all in one global filehandle object called "SEM".  You can use lexical
filehandles using the Perl 5.6 C<open my $fh,...> syntax, as here:

  {
    use Fcntl ':flock';  # import LOCK_* constants
    open my $sem, ">dodad.sem"
      or die "Can't write-open dodad.sem: $!";

    flock $sem, LOCK_EX;
    
    ...things dealing with the resource that dodad.sem denotes
     a lock on...
     
    close($sem);
  }

In fact, the C<close($sem)> command there isn't particularly necessary --
assuming you haven't copied the object from C<$sem> into any other variable
in memory, then when the program hits the end of the block where C<my
$sem> was declared, then Perl will delete that variable's value from
memory, and then seeing that that is the only copy of that filehandle
object, it will implicitly close the file, releasing the lock.


The benefit of using C<my>'d filehandles instead of a globals, is that
it avoids namespace collisions; you could have other C<my $sem>
variables defined in other scopes in this program, and they wouldn't
interfere with this one.  But creating each semaphore object would
still require the same repetitive C<open> and C<flock> calls, and
needless repetition is no friend of programmers.  We might as well
wrap it up in a function:

  sub sem {
    my $filespec = shift(@_) || die "What filespec?";
    open my $fh, ">", $filespec
     or die "Can't open semaphore file $filespec: $!";
    chmod 0666, $filespec; # assuming you want it a+rw
    use Fcntl 'LOCK_EX';
    flock $fh, LOCK_EX;
    return $fh;
  }

And then whenever you want a semaphore lock on a file, you need only call:

  my $sem = sem('/wherever/locks/thing.sem');

All you would then do with that object in C<$sem> is keep it around as
long as you need the lock on that semaphore file; or you could explicitly
release the lock with just a C<close($sem)>.

If you were an OOP fan, you could even wrap this up in a proper class,
an object of which denotes an exclusive lock on a given semaphore
file.  A minimal class would look like this:

  package Sem;
  
  sub new {
    my $class = shift(@_);
    use Carp ();
    my $filespec = shift(@_) || Carp::croak("What filespec?");
    open my $fh, ">", $filespec
     or Carp::croak("Can't open semaphore file $filespec: $!");
    chmod 0666, $filespec; # assuming you want it a+rw
    use Fcntl 'LOCK_EX';
    flock $fh, LOCK_EX;
    return bless {'fh' => $fh}, ref($class) || $class;
  }
  
  sub unlock {
    close(delete $_[0]{'fh'} or return);
    return 1;
  }
  1; # End of module

Then you need only create the proper semaphore objects like so:

  use Sem;
  my $sem = Sem->new('/wherever/locks/thing.sem');
 ...later...
  $sem->unlock;


=head2 Conclusion

If you've got a data file that's only ever manipulated by one program,
and you're sure you'll never run multiple simultaneous instances of that
program, then you don't need semaphore files.  But you need semaphore
files in all other cases, where you have a file or other resource
that is accessed by potentially simultaneous processes
(whether different programs, or instances of the same program), and
if that resource could suffer from uncontrolled simultaneous access.

In this article, I've assumed that the programs that you need semaphore
files for, are all running on the same machine, that that machine runs
Unix (or something with the same basic locking semantics), and that the
filesystem you're putting the semaphore files on isn't NFS (which often
doesn't implement locking properly).  In my next I<The Perl Journal>
article, I'll discuss
what to do if you need semaphore files, but either you're not under Unix,
or the processes you're needing to coordinate are running on several
different machines.

 __END__

Sean M. Burke (C<sburke@cpan.org>) lives in New Mexico, where he mostly
does data-munging for Native language preservation projects.