As Alex points out, the clusters are finally coming. I’ve updated the outage schedule to reflect the first round of services being moved over. Everything in this round should be transparent, it’ll just be DNS changes, so you probably won’t even notice it happening.
ftp-staging will be down for an hour on Friday
We will be moving the ftp-staging server physically from one colo to the other (since it’s the only machine we have with enough disk space) at or shortly after 10:00am on Friday. If all goes well, it’ll be up and running again at the new location within 15 minutes, but we’re allowing a window of about an hour in case of any unforeseen circumstances. This could potentially disrupt automated nightly build uploads from tinderboxes.
You can continue to watch http://nagios.mozilla.org/outages/ for up-to-the-minute information about the moves.
Multiple outages planned June 20-24
The Mozilla Foundation has had much of its server infrastructure hosted by Meer.net since we left AOL. Meer.net has set up a new colocation facility a few miles away from the one we were originally using, which is a much nicer facility with better access control and server racks that are much easier to work with. We’ve been in a “transition” phase for a few months, with some servers hosted in each facility, moving a few things here and there to the new facility as time and opportunity allows. We’re now down to the final servers to be moved, which have been getting put off because they’re all the end-user-facing servers running the various websites and developer services, and taking those down to move them means impacting end-user services.
But even those have to be moved sometime, and we’ll be attempting to move them during the week of June 20 to 24. During that week, there will be sporadic outages of various services, anywhere from a few minutes to a few hours each, as various services or servers get moved.
The following public-facing services will be affected:
Bonsai,
Bugzilla,
Despot,
Hendrix,
LXR,
Tinderbox
CVS,
CVS-mirror,
CVS-www,
IRC
mozilla.org email and mailing lists, mozillafoundation.org email
www.mozilla.org (2 of the 3 servers)
www.bugzilla.org
developer.mozilla.org (“Devmo”) (not developer-test)
wiki.mozilla.org,
reporter.mozilla.org,
planet.mozilla.org
ftp-staging
primary DNS service for almost every domain we own
Talkback services
We have lots of new servers. The plan is to rack a bunch of the new servers in the new facility and start moving services over by staging them on the new machines, then switching DNS at the point the new machines are ready to take over. This can be done with minimal downtime on almost all of the above services except for Talkback, which will probably need to be shut down and moved in the original servers, because as far as I know, it’s a pain in the butt to set up, and nobody wants to do it all over again 🙂
Individual outages will be announced on nagios (I’ll put a link in the middle of the main page) somewhere between a couple hours and a day in advance of each outage.
Fun with proxy servers
This last week saw the Firefox 1.0.4 security update firedrill. When the exploit in question was leaked, and it was noticed that it was exploiting the default extension install whitelist which included the addons.mozilla.org site by default, we decided to redirect all traffic to that site to another domain outside the whitelist in order to short-circuit the exploit. We also did some fiddling with the mime types on the FTP servers so clicking a link to an extension on the addons site would trigger a download of the extension file instead of automatically installing it.
Once Firefox 1.0.4 was out, with the security holes fixed, we could undo all of that and put the site back how it was. With one exception… the security hole still affects users of Firefox 1.0.3 and older. So we now sniff the UserAgent and redirect anyone using 1.0.3 or older to a page on www.mozilla.org telling them they need to upgrade. Yes, we know UserAgents can be spoofed. We also figure that the people who are enough of a poweruser to spoof their UserAgent are probably enough of a poweruser to know they need to upgrade on their own, and this still blocks the default case of your mom running Firefox unaltered.
Squid (which we’ve been using for our proxy servers for the addons site) can’t do redirects based on a UserAgent. So we built an RPM of Apache 2.1.3 for RHEL 4, and installed that on two of the new servers, using mod_proxy and mod_cache, and got lots of help from Paul Querna (a developer on the Apache httpd project) setting it up.
I must say, the new proxy and caching features in Apache are pretty freaking sweet. You get a heck of a lot more control over the way the content is proxied, can have multiple backend servers split up by subdirectory under the same domain name, can even serve content locally in addition to proxying. You can efficiently issue 302 and 301 redirects from the proxy server itself instead of having to have hundreds of threads from a rewrite engine running in the background or having to pass them through to the back-end server. Combining the power of mod_rewrite with the power of mod_proxy and mod_cache is a beauty to behold. My initial reaction to all of these features was “gee, it must have a performance cost compared to squid”, but it seems to be keeping up with our traffic just fine so far.
One Hour of Terror
“One hour of terror” — This is how we’ve jokingly started referring to the first hour of every month (measured on GMT) because of the bug in the 1.0 version of Firefox which causes it to only check for updates between the first of the month and the first Sunday of the month. Firefox checks with itself once per hour to see whether it’s been long enough since the last time it checked the server for updates to check again. And any version 1.0 of Firefox that happens to be running at midnight GMT is going to have that check fire within the first hour of the clock ticking over past midnight. This absolutely SLAMs our servers with every known copy of Firefox 1.0 (shame on people for not upgrading) checking in during that first hour instead of how they’re usually spread across the entire day.
Today is May 1st. Last night, at midnight GMT, was that hour. See the bandwidth graphs. We’d been hoping to have our new hardware set up already by now, but it just arrived this last week, and we haven’t had time to configure it yet. Sky (which has been handling the application update service all by itself until now) took a beating right at midnight (17:00 on the charts I linked above). Within a few minutes, I managed to clone the webserver configuration onto Star (a machine about to be deployed for use by the Talkback services, but which the Talkback folks haven’t actually set anything up on yet) and added Star to the rotation so the requests were split between Sky and Star.
The scary part? When you look at those graphs, you’ll notice Star’s bandwidth skyrocket when it was added to the rotation, but Sky’s bandwidth didn’t go down at all. This means Star didn’t ease any load off of Sky at all, it just picked up load that hadn’t been making it through to begin with. Ooof.
We served over a million requests during that first hour (850,000 on sky and 200,000 on star), and about half a million during the second hour (split close to evenly). And the graphs make it plain to see that we weren’t serving all the requests that were coming in.
Next month we’ll be ready for it.