The Technical Operations team has just completed behind-the-scenes work that will likely never be noticed by our readers.

Our External Storage databases hold the text for every version of every wiki page; they have slowly grown over the life of Wikipedia and its sister projects. Ten years is a lifetime on the Internet, and the incremental changes that were made to our external storage system over that period, though appropriate at the time they were made, resulted in a setup that was a challenge to maintain and which was becoming unreliable.

Graph of query durationWe spent a few weeks analyzing all the various servers across which the page text data was spread, in order to gather it all together onto a single host. From there, it could be replicated onto newer, more reliable and higher performance hardware. Along the way, we found and fixed a number of inconsistencies to make the dataset more regular.

The deployment of the new hardware lasted a few days (as we moved things piece by piece) and was finished this past Monday with no fanfare. There was a brief (about 10 minute) period during which articles were unable to be edited while we switched writes to the new hardware. The end result is a barrage of small improvements, all of which together make for a happy TechOps team:

  • average query duration has dropped from about 15ms to around 8ms and the worst case from 576ms down to 60ms;
  • replication and failover processes are now well known and standardized;
  • total hardware used has dropped from around 30 servers to 8, now in two locations;
  • hosts no longer double up as web servers and database servers for text; dedicated servers are used for the database.

It’s a small victory in the battle against entropy, but an important prerequisite for carrying out our mission of providing unfettered and reliable access to the sum of all knowledge.

Ben Hartshorne, Operations Engineer