I’ve been a little paranoid recently about doing various tasks at work. Even on little things: we’re changing domain-name registrars, and in the changeover from one registrar from another we lost some domain-name records. So for a short time, some of our domains were down, and I had to scramble to recreate the DNS records on the new registrar. It took a (mercifully) short time for the DNS changes to propagate, so the total downtime was probably only an hour. And thankfully we were switching registrars only for our relatively low-traffic domains, as a test for the big registrar switches that we’ll be doing sometime later. At least we now know what to worry about the next time we make a switch. In particular, we’re going to want the registrar switch to go into effect at around 4:00 in the morning on a Sunday, if such a thing is possible, so that one of us can wake up and change the DNS entries when no one is visiting the sites.
It’s a new experience for me to be working with highly mission-critical services that a lot of people are using. I’ve helped switch clients over from one mail service to another, or one host to another, and that involves a fair bit of stress. But the stuff I’ve had to do for work has been of an entirely different order.
Example #2 of this comes from a little script I wrote with my boss a couple weeks ago. The goal is to get log rotation for our Apache logs, à la logrotate. Logrotate is nice, but for various reasons (none of which I can remember now) we couldn’t use it. (I wish I could remember. It would make the story somewhat more interesting.)
So we had to roll our own script. The main goal was to lose absolutely no log entries from our high-volume websites. First of all, this means running the script at 4:30 a.m. Sunday morning, when the amount of traffic is at its lowest. The first useful approach we considered was to cut the oldest website hits — those that were two weeks old or older — out of the current access log and paste them into a new file, then delete them from the current access log. Since we’d be cutting the oldest entries, we wouldn’t (so the thought went) interrupt Apache: Apache writes new entries to the end of the file, while we’d be cutting from the beginning. In outline form, we’d be doing this:
head -n $NUMLINES currentlog > oldlog sed -i ‘1,${NUMLINES}d’ currentlog
That didn’t work, however. When sed is invoked with the ‘-i’ (“edit in place”) argument, it opens the file for both reading and writing. Since Apache also has the file open for writing, it freaks out: it needs to have exclusive write access to the file. After it fails to get a lock on the file, it stops writing to that file. That is, from the moment we invoke sed -i, Apache can’t write to the current access log; we would need to restart Apache (via apachectl graceful). So this approach doesn’t work. (As a sidebar: I wish there were some way to get finer-grained read-write access to a file. By all rights, Apache shouldn’t freak out about someone else trying to write to the same file; I know that I’ll be writing to different lines than Apache will be, but Apache doesn’t know that. And in a flat text file, there is no way to specify record locking; one can only lock entire files. For record locking, we’d want to use something like a database, and for a lot of reasons it occurs to me now that maybe a database would be the best way to structure large log files. Huh. Maybe I’ll propose this at work.)
Fortunately I was paranoid about getting this to work right, so I was very careful about testing this approach on a non-production logfile before deploying it on our server. I used a test server, created a test HTML file, and used ApacheBench to hammer the server with a large number of hits. Then I deployed a trial version of the script on the test server, which cut any hits out of the access log that were 5 minutes old or older. When I ran this test script, I noticed that the access log stopped writing. So that was safe, and I fixed the problem before deploying it to the new server.
Approach number 2, in response to the exclusive-lock problem, was to
- Copy the oldest lines from the current access log to a new file (this step doesn’t require us to get a lock on the current access log);
- Copy the current access log to a temporary file;
- Delete the oldest lines from that temporary file; and finally
- Copy that temporary file back over the current logfile.
We’ll lose some Apache hits between steps 3 and 4. In particular, we’ll lose whatever hits may have come into the current access log between the deletion and the copy. But with any luck, if we run this early in the morning on Sunday, we won’t lose terribly many hits. And this seemed to be the best we could do.
We tried this out on a low-volume website, and got a series of little bugs that I ironed out throughout the week. By the end of the workweek, we felt comfortable going ahead and doing the log rotation on our main website’s files. It ran today at 4:30 a.m.
. . . And so I just discovered an additional problem that comes with scale: since the logfile for our main website is gigantic, all the temp files and so forth take up a large amount of space. Our current log is somewhere on the order of 2 gigs, and even since the beginning of today it’s added something like 50 megs. When the script created temp files, it rapidly consumed all the disk space on that partition. Had the sed -i worked, I don’t believe we would have had this problem, since the file would have been edited in place and would only have consumed a little bit of scratch space. But alas, we did consume all the disk space, and the script failed. Schematically, here’s how the script looked overall:
for each LOGFILE in SETOF_LOGFILES; do TMPFILE = file with some unique name NEWFILE = file with a unique name based on LOGFILE’s name copy oldest lines from LOGFILE to NEWFILE copy LOGFILE to TMPFILE delete oldest lines from TMPFILE copy TMPFILE over LOGFILE done once above loop has finished, compress all old-log-entry files using gzip move all gzipped files into ARCHIVE directory for future post-processing
I think part of the problem is that while the loop is running, we’ve got a lot of temp files laying around, taking up a great deal of space. What we need to do is gzip each of them as they appear, then move them elsewhere immediately.
The lesson in all of this is that it’s really hard to think of all the problems that will arise when scaling from a test environment to a production one. Time for me to go off and fix this up.