Multiple outages planned June 20-24

The Mozilla Foundation has had much of its server infrastructure hosted by Meer.net since we left AOL. Meer.net has set up a new colocation facility a few miles away from the one we were originally using, which is a much nicer facility with better access control and server racks that are much easier to work with. We’ve been in a “transition” phase for a few months, with some servers hosted in each facility, moving a few things here and there to the new facility as time and opportunity allows. We’re now down to the final servers to be moved, which have been getting put off because they’re all the end-user-facing servers running the various websites and developer services, and taking those down to move them means impacting end-user services.

But even those have to be moved sometime, and we’ll be attempting to move them during the week of June 20 to 24. During that week, there will be sporadic outages of various services, anywhere from a few minutes to a few hours each, as various services or servers get moved.

The following public-facing services will be affected:

Bonsai,
Bugzilla,
Despot,
Hendrix,
LXR,
Tinderbox
CVS,
CVS-mirror,
CVS-www,
IRC
mozilla.org email and mailing lists, mozillafoundation.org email
www.mozilla.org (2 of the 3 servers)
www.bugzilla.org
developer.mozilla.org (“Devmo”) (not developer-test)
wiki.mozilla.org,
reporter.mozilla.org,
planet.mozilla.org
ftp-staging
primary DNS service for almost every domain we own
Talkback services

We have lots of new servers. The plan is to rack a bunch of the new servers in the new facility and start moving services over by staging them on the new machines, then switching DNS at the point the new machines are ready to take over. This can be done with minimal downtime on almost all of the above services except for Talkback, which will probably need to be shut down and moved in the original servers, because as far as I know, it’s a pain in the butt to set up, and nobody wants to do it all over again 🙂

Individual outages will be announced on nagios (I’ll put a link in the middle of the main page) somewhere between a couple hours and a day in advance of each outage.

Fun with proxy servers

This last week saw the Firefox 1.0.4 security update firedrill. When the exploit in question was leaked, and it was noticed that it was exploiting the default extension install whitelist which included the addons.mozilla.org site by default, we decided to redirect all traffic to that site to another domain outside the whitelist in order to short-circuit the exploit. We also did some fiddling with the mime types on the FTP servers so clicking a link to an extension on the addons site would trigger a download of the extension file instead of automatically installing it.

Once Firefox 1.0.4 was out, with the security holes fixed, we could undo all of that and put the site back how it was. With one exception… the security hole still affects users of Firefox 1.0.3 and older. So we now sniff the UserAgent and redirect anyone using 1.0.3 or older to a page on www.mozilla.org telling them they need to upgrade. Yes, we know UserAgents can be spoofed. We also figure that the people who are enough of a poweruser to spoof their UserAgent are probably enough of a poweruser to know they need to upgrade on their own, and this still blocks the default case of your mom running Firefox unaltered.

Squid (which we’ve been using for our proxy servers for the addons site) can’t do redirects based on a UserAgent. So we built an RPM of Apache 2.1.3 for RHEL 4, and installed that on two of the new servers, using mod_proxy and mod_cache, and got lots of help from Paul Querna (a developer on the Apache httpd project) setting it up.

I must say, the new proxy and caching features in Apache are pretty freaking sweet. You get a heck of a lot more control over the way the content is proxied, can have multiple backend servers split up by subdirectory under the same domain name, can even serve content locally in addition to proxying. You can efficiently issue 302 and 301 redirects from the proxy server itself instead of having to have hundreds of threads from a rewrite engine running in the background or having to pass them through to the back-end server. Combining the power of mod_rewrite with the power of mod_proxy and mod_cache is a beauty to behold. My initial reaction to all of these features was “gee, it must have a performance cost compared to squid”, but it seems to be keeping up with our traffic just fine so far.

One Hour of Terror

“One hour of terror” — This is how we’ve jokingly started referring to the first hour of every month (measured on GMT) because of the bug in the 1.0 version of Firefox which causes it to only check for updates between the first of the month and the first Sunday of the month. Firefox checks with itself once per hour to see whether it’s been long enough since the last time it checked the server for updates to check again. And any version 1.0 of Firefox that happens to be running at midnight GMT is going to have that check fire within the first hour of the clock ticking over past midnight. This absolutely SLAMs our servers with every known copy of Firefox 1.0 (shame on people for not upgrading) checking in during that first hour instead of how they’re usually spread across the entire day.

Today is May 1st. Last night, at midnight GMT, was that hour. See the bandwidth graphs. We’d been hoping to have our new hardware set up already by now, but it just arrived this last week, and we haven’t had time to configure it yet. Sky (which has been handling the application update service all by itself until now) took a beating right at midnight (17:00 on the charts I linked above). Within a few minutes, I managed to clone the webserver configuration onto Star (a machine about to be deployed for use by the Talkback services, but which the Talkback folks haven’t actually set anything up on yet) and added Star to the rotation so the requests were split between Sky and Star.

The scary part? When you look at those graphs, you’ll notice Star’s bandwidth skyrocket when it was added to the rotation, but Sky’s bandwidth didn’t go down at all. This means Star didn’t ease any load off of Sky at all, it just picked up load that hadn’t been making it through to begin with. Ooof.

We served over a million requests during that first hour (850,000 on sky and 200,000 on star), and about half a million during the second hour (split close to evenly). And the graphs make it plain to see that we weren’t serving all the requests that were coming in.

Next month we’ll be ready for it.

Mozilla Foundation hiring a System Administrator

Anyone who’s been around on IRC knows that life as a system administrator at the Mozilla Foundation is pretty darn busy. It’s definitely more than a one person job, and now we can finally do something about it 🙂 Think you’ve got what it takes to be a part of the Mozilla sysadmin team? I could use the help! 🙂 Here’s the job posting.

Changing the domain for mozilla.org mailing lists

One of the projects that we’ve been discussing for a while is to move the mailing lists on mozilla.org to a separate domain name, such as lists.mozilla.org. The primary intent of this is to be able to use stricter anti-spam controls on it than we use on the personal and role addresses, since the moderation queues are constantly filled with spam messages that the list moderators can’t keep up with.

It ocurred to me the other day that the news hierarchy changing, and the corresponding changes to list names going with it, would be the perfect opportunity to kill this off with minimal disruption. Since many of the list names will be changing anyway, we can change the domain name at the same time without any additional disruption.

Polvi is looking into it to see if we can be ready to do that when the time comes to throw the switch.