I recently did up a diagram of how our Bugzilla site was set up, mostly for the benefit of other sysadmins trying to find the various pieces of it. Several folks expressed interest in sharing it with the community just to show an example of how we were set up. So I cleaned it up a little, and here it is:
At first glance it looks somewhat excessive just for a Bugzilla, but since the Mozilla Project lives and dies by the content of this site, all work pretty much stops if it doesn’t work, so it’s one of our highest-priority sites to keep operating at all times for developer support. The actual hardware required to run the site at full capacity for the amount of users we get hitting it is a little less than half of what’s shown in the diagram.
We have the entire site set up in two different datacenters (SJC1 is our San Jose datacenter, PHX1 is our Phoenix datacenter). Thanks to the load balancers taking care of the cross-datacenter connections for the master databases, it’s actually possible to run it from both sites concurrently to split the load. But because of the amount of traffic Bugzilla does to the master databases, and the latency in connection setup over that distance, it’s a little bit slow from whichever datacenter isn’t currently hosting the master, so we’ve been trying to keep DNS pointed at just one of them to keep it speedy.
This still works great as a hot failover, though, which got tested in action this last Sunday when we had a system board failure on the master database server in Phoenix. Failing the entire site over to San Jose took only minutes, and the tech from HP showed up to swap the system board 4 hours later. The fun part was that I had only finished setting up this hot failover setup about a week prior, so the timing couldn’t have been any better for that system board failure. If it had happened any sooner we might have been down for a long time waiting for the server to get fixed.
When everything is operational, we’re trying to keep it primarily hosted in Phoenix. As you can see in the diagram, the database servers in Phoenix are using solid-state disks for the database storage. The speed improvement when running large queries that is gained by using these instead of traditional spinning disks is just amazing. I haven’t done any actual timing to get hard facts on that, but the difference is large enough that you can easily notice it just from using the site.
should note that flash storage for mysql in Phoenix makes Bugzilla -fast-.
What sort of traffic does Bugzilla get to warrant this sort of setup, in terms of hits/sec or queries/sec or whichever metric is appropriate?
I don’t really have a feel for how big BMO is.
what’s so special about china so that is treated separately?
@tom jones: We cache some of the content (css, javascript, and images that are part of the site layout rather than bug content) to help it perform better, since connectivity isn’t always the greatest there. We have a datacenter in Beijing, so it made sense to use it. There wasn’t enough hardware there to fully-replicate the site though.
@Alex: the average amount of traffic that bugzilla.mozilla.org gets could easily be handled on one webserver and one slave database server, however, it tends to be pretty “bursty”.
For example, every so often a major bug will get fixed (or some controversy will arise from one) and a bug will get linked from major news sources.
Another example is triage meetings for our development teams. Because of our distributed nature (last I heard around 40% of our employees are remote, and that doesn’t count a lot of the volunteers that are involved as well), and when you have a telephone conference call to discuss a bug list and everyone on the call suddenly loads that same buglist in their browser at once that’s a lot of query traffic.
These situations don’t happen often, but it was important to us to keep the site responsive when it does happen, so there’s sufficient hardware in place to handle these bursts.
Gerv Markham just posted some stats I dug up the other day for the kind of usage we have, which is over at http://weblogs.mozillazine.org/gerv/archives/2011/05/bugzillamozillaorg_metrics.html
The stats Gerv posted only give you a feel for people who create content in Bugzilla, it doesn’t count people who only browse. I’m seeing if I can come up with some numbers for that, too.
How big is the database? I’ve always wondered how feasible it would be to have a local copy of the Bugzilla database on my laptop for querying while on a plane. 🙂
OK, so our metrics guys tell me we aren’t actually recording any pageview stats for Bugzilla, but a quick grep of the weblogs tells me we’re getting roughly a million hits per day from about 70,000 unique IP addresses.
Awesome information, always wondered what the BMO infrastructure looked like exactly.
With this setup, do PHX transactions need to wait for acknowledgment from the SJC1 master to commit? If so, did you notice delays from it?
@Cameron McCormack: The live data is running right about 30 GB right now, the gzipped mysqldump for the daily backups runs about 14 GB.
@Michael Kurze: The masters have blind two-way replication between them. MySQL won’t duplicate transactions because it filters out binlog entries that were generated by its own server ID. You can write to either master and it will immediately take, and this does mean you can potentially have insert ID conflicts. We deal with this by having the ACLs set up so that the load balancers are the only sources allowed to connect to MySQL, and both load balancers are set up to only allow writes to go to one of the masters (and always the same one). We can swap masters at any time by switching which one is active in the load balancers (but this does require taking the site offline long enough to make sure all of the connections to the one you’re moving off of complete before the one you’re moving onto is brought online, to avoid conflicts).
Are you concerned that SSD devices have limited write cycles per cell or you simply don’t care?
Because of the amount of performance improvement we don’t really care… it’s a necessary cost for the performance we need. That said, SSD devices aren’t nearly as write-limited a they once were. Modern SSD drives have a high enough re-write capacity to last almost as long as a platter-style hard drive.
Hi Dave,
I apologize for reviving this 2-year old topic 🙂 but I’m currently going through an exercise of making my company’s Bugzilla scale and be more fault-tolerant. While I do have experience with both Bugzilla and Web/MySQL services to that matter, I still learn a lot from that and would be really grateful if you could clarify several things in your setup for me, like:
1) Your web servers are load-balanced, with Bugzilla, does that mean you have several Bugzillas (web part of it) connecting to the same DB? I know BZ can run like that for the most part (thanks to mid-air collision detection), but what about multiple bug editing – there’s no mid-air collision detection for that? Or it is implemented differently, somehow?
2) With attachments stored on NAS and mounted via NFS to those multiple web servers, have you seen any problems with locking/data integrity (e.g. when several people work on the same bug and modify attachments)?
@Serge:
1) Yes. The mid-air collision detection works exactly the same no matter how many web heads you have – whoever gets to the database first to store the change wins, and everyone after that gets a collision. The attempt to change it is made when the user submits the change, not when they load the bug to view it. This is no different than two people trying to edit at the same time on the same webserver from different browsers.
2) No, since the attachment metadata is still stored in the database, and the same mid-air protections apply (and in our case, we’re actually storing the attachments themselves in the database, the NAS is only used for configuration, chart data, etc.
Thanks, that clarifies it!