The original publication of this blog post can be found here.
The software behind Wikimedia’s website for tracking software issues and feature requests was recently updated to a newer version and moved onto a new machine in a different datacenter. Furthermore, proper configuration management for this software was set up. This post explains the technical details and challenges.
Though we currently also evaluate Wikimedia’s project management tools, we will have to stick with our current infrastructure for a while. Among many other tasks, I spent the last few months preparing the upgrade of Wikimedia’s Bugzilla instance from 4.2 to 4.4. Some reasons for upgrading can be found in this Bugzilla comment.
In late November of 2013 I started cleaning up Wikimedia Bugzilla’s custom CSS which was copied about five years ago and not kept in sync. It turned out that 16 out of 22 files could be removed since there was no sufficient difference to upstream’s default CSS code (Bugzilla falls back to loading the default CSS file from /skins/default if no custom CSS file is found in /skins/custom). Less noise and less diffing required for future upgrades. In theory.
After testing these CSS changes on a Wikimedia Labs instance and merging them into our 4.2 production instance, I created numerous patches and put them into Gerrit (Wikimedia’s code review tool) by diffing upstream 4.2 code, upstream 4.4 code and our custom code.
At the same time, Wikimedia’s Technical Operations team wanted to move the Bugzilla server from the kaulen server in our old Tampa datacenter to the zirconium server in our new Ashburn (Eqiad) datacenter. While you’d normally prefer to do only one thing at a time, Daniel Zahn (of Technical Operations) and I decided to create a fresh Bugzilla 4.4 instance from scratch on the new server to see into which problems we would run. During this process Daniel Zahn turned the old setup on kaulen, which was largely manual and had organically grown over the years into a proper Puppet module. For every “missing module” error we ran into we avoided installing anything from Perl’s CPAN in Bugzilla’s /lib folder and ensured we just relied on distribution packages for a much cleaner install. Daniel Zahn installed the needed packages by adding them to puppet code. While doing this we also removed Bugzilla’s Sitemap extension as it created sporadic Search::Sitemap errors when running Bugzilla’s checksetup.pl (plus it’s unmaintained anyway). Furthermore I ran into another runtime error to fix.
After fixing all checksetup.pl issues and having Bugzilla accessible via a web browser, only Bugzilla’s upstream CSS was displayed instead of our custom CSS. Wikimedia’s custom CSS was not offered as an option in the browser, nor could I log into the new Bugzilla (to check which theme is set as default in the admin settings) as the database dump we used for testing predated the creation of my user account.
After Sean Pringle of Technical Operations deployed a more recent Bugzilla database dump I expected further problems due to upstream changes to CSS loading. I was happy to see that I had been wrong: there were no problems with our custom CSS theming anymore. Instead, I ran into problems with our custom “See Also” field changes: Adding and removing such URLs triggered errors and URLs themselves were not displayed (but their corresponding “Remove” checkbox). Thanks to upstream help in #bugzilla on Mozilla IRC I finally found out that Perl’s use base instead of use parent was the culprit.
After creating symlinks to /extensions/WeeklyReport/ to avoid 404 errors for the “Weekly Bug Summary” link in the sidebar (our setup is slightly busted) and after fixing two problems with our cronjobs for whining and data collection we agreed on a date to copy the database, do some maintenance work and switch the DNS entry. This was announced one week in advance by adding a banner to Bugzilla via its announcehtml parameter.
A few hours before the switch on February 12th 2014, Daniel lowered the Time-to-live (TTL) values of the DNS entry of our Bugzilla. When the migration started, I set Bugzilla’s shutdown parameter to make the web UI inaccessible and also the WebService API return a 503 error for the Bingle script that syncs Bugzilla with Wikimedia’s Mingle instance. It was important to make sure that nobody can write anymore to the old database. We updated the IRC channel topic in #wikimedia-tech to tell that Bugzilla is under scheduled maintenance and logged the action in #wikimedia-operations so it got added to Wikimedia’s Server admin log. All in all we had only forgotten two minor things: Our Gerrit integration (a bot adding Gerrit notifications about related patches as comments in Bugzilla) bot was not able to write and got a 503 error back – Chad quickly disabled it. Our Nimsoft watchmouse sent an “ALERT! Bugzilla: Service Temporarily Unavailable” message to the Operations mailing list.
Sean Pringle migrated the old database from db9 in Tampa, to a new database on db1001 in Eqiad. After this was done, Daniel Zahn ran checksetup.pl to apply the scheme upgrades needed for 4.4.
After 30 minutes of testing to make sure everything worked as expected we deployed two more custom patches: Showing common queries on the frontpage and making saved reports work. While having the downtime I also switched off bugmail to do some mass-changes without spamming everybody: I merged some version numbers in the “MediaWiki” product to have a shorter Version dropdown, removed the wikibugs-l watcher account from some bug reports as it is unneeded (set as a global watcher in Bugzilla anyway, hence a potential issue if a ticket was moved to a restricted product like “Security” still triggering public bugmail).
A few minutes before the end of the announced downtime of three hours, Daniel switched DNS so the new Bugzilla on the new server became available to the public. A few hours later, to work around isses for clients not supporting SNI, Daniel changed the order in which Apache loads virtual hosts. This ensures that older clients like Internet Explorer on Microsoft Windows XP will always get to see Bugzilla instead of other miscellaneous web services sharing the same hardware. I had also overlooked a small UI issue that I fixed two days later.
Now that all is done, the result can be seen on bugzilla.wikimedia.org. All steps to upgrade Wikimedia Bugzilla from 4.2 to 4.4 were documented on a wiki page. You can find all of our custom modifications here.
Andre Klapper, bug wrangler for the Wikimedia Foundation