Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

Posts Tagged ‘Tampa’

How the Technical Operations team stops problems in their tracks

Last week, you read about how Wikimedia Foundation’s Technical Operations team (“Ops”) spent hundreds or thousands of staff hours to refactor and automate all the services it provides, to prepare for the January data center migration. One reward from that work: our sites were not down as often, and when they were, downtime was for better reasons.

“Another thing that illustrates our growth and maturity is our downtime,” says Operations engineer Peter Youngmeister. “Something that’s less visible to people outside of Ops is the kind of downtime we have. For example, we no longer have much downtime of the variety of ‘Oops, bumped that cable’ or ‘That one box died,’ because things are much more robust now, much more redundant. A lot of that is a product of the massive automation push we’ve been going through, which lets us create redundancy far more easily, and lets us spend our time not fighting fires.”

Wikimedia Foundation engineer Roan Kattouw adds: “Or, ‘the master DB server has a full disk’ — that one happened a few times a few years ago, and doesn’t happen any more now.”

To fix crises fast, we need monitoring: tools that automatically check for problems and alert our engineers when something is broken. In the very early days of our sites, we simply trusted that there would usually be a sysadmin online and available in case someone noticed a problem and complained on IRC. Several years ago, we began to use Nagios for monitoring and assigned a “pager duty” rotation to decide who might be woken up by a crisis.

Nagios runs coarse automated tests on the behavior of our site (such as “Does port 80 return an HTTP 301?”) and checks certain key numbers to make sure they’re within the desired range (for instance, to test whether we’re running out of memory). If a test fails, Nagios sends out email, IRC, and SMS alarms.

Monitoring helps us address the crisis faster, but it often doesn’t help with the actual problem-solving.

“Nagios is great for telling you when things are broken, and crap for telling you why,” Peter explains. “The work that Asher Feldman has done creating profiling data is more useful.”

Monitoring our servers (here in Ashburn, Virginia) helps to minimize outages and services disruptions.

Monitoring our servers (here in Ashburn, Virginia) helps to minimize outages and services disruptions.

As Roan puts it: “Profiling is the act of generating data on ‘How much time does large task X spend doing small subtask Y?’ The reason for that is that 1) one of those small Ys might actually be not so small, and be a problem, and 2) per the 80-20 rule, for some Ys, optimization will have a larger impact, so you wanna find those.” Profiling generates knowledge about the behavior of our systems, so that engineers can better understand how the cluster should be operating, and offers data points for troubleshooting.

We use two profiling systems to get time-series performance data: Ganglia at the “host” level, and Graphite at the “application” level (get a Labs login to see Graphite). In the past two years, we’ve configured Ganglia to cover much more data, and in 2012 began to use Graphite. The better data makes it more useful for troubleshooting, and Director of Operations CT Woo regularly checks the dashboard to look out for upcoming problems and alert his team. This reduces downtime.

For example, on one ganglia page, we previously only had access to host data: free disk, load, etc. We have recently added the Apache-specific data, such as requests per second and number of idle threads. This additional information aids sysadmins in troubleshooting. “One can look at it and make better deductions than just ‘Yup, server’s under a lot of load…’,” explains Peter.

Like puppetization, improvements in profiling were an investment by the Ops team. “There’s a plug-in for Ganglia that does Apache performance stats. It took me a couple of hours to set it all up. But, again, that’s being forward-thinking, debt that we had to work off instead of just cursing ourselves when it wasn’t there when we needed it. It’s a massive undertaking to decide to do things The Right Way, set up a platform, instead of doing a million one-offs.”

While puppetizing and improving monitoring and profiling to prepare for the data center migration, the Operations team had to defer other non-urgent work. “Ops was less able to give support to many teams,” says Peter. “For example, Fundraising just had a couple of boxes and could do whatever they wanted on them, as opposed to now where [Operations Engineer] Jeff Green is working on making an awesome, PCI-compliant system with them full time. Or, Analytics was very independent/unsupported, because there were so little human-hours to give to supporting things that weren’t just keeping the site up… I think that the EQIAD [Virginia Data center] build-out is very demonstrative of the amount of [technical] debt that Ops was in.”

Now, Peter is looking forward to seeing Wikimedia “spin up more data centers dramatically more quickly.” The Operations team is making preparations for an additional data center on North America’s west coast. Site Architect Asher Feldman sees a “continuing arc of refinement” in the team’s future, rather than “challenges that end, to be replaced by new ones.” “The challenges of making MediaWiki scale aren’t going to go away any time soon; nor will the need for incremental architecture modernization at multiple levels.” For instance, Ops needs to continue puppetizing certain services; some modules also need their Puppet manifests tweaked so that they work not just on the main site, but also in Wikimedia Labs.

You can check out the Operation’s team 2012–2013 goals to find out more about what’s next (including improvements in search and security).

Sumana Harihareswara, Engineering Community Manager

From duct tape to puppets: How a new data center became an opportunity to do things right

Last week, the Wikimedia Foundation flipped a historic switch: we transitioned our main technical services to a shiny new data center in Ashburn, Virginia. For the first time since 2004, Wikimedia sites are no longer primarily hosted in Tampa, Florida.

Peter Youngmeister works in the Wikimedia Foundation's Technical Operations team.

Peter Youngmeister works in the Wikimedia Foundation’s Technical Operations team.

To help understand this grueling journey (and why it’s crucial), look through the eyes of Wikimedia Foundation engineer Peter Youngmeister. Peter joined the Wikimedia Foundation’s Technical Operations team (“Ops”) about two years ago, in March 2011. At the time, “the team” meant “about six engineers supporting the fifth-most visited site on the Web,” said Peter. The Foundation has now increased its Ops team to 14, and has several job openings.

“This also meant that out of the fast/cheap/well triangle, we’d gone with fast and cheap,” Peter recalled. We made quick-and-dirty solutions because problems had to be solved immediately. “With so few Ops engineers, you’re always playing catchup; long-term is hard.” He said that the digital infrastructure when he arrived was “kinda like many many layers of really artfully applied duct tape.”

And the biggest, most pressing flaw: Wikimedia only had one fully functional primary data center, in Tampa, Florida. If something catastrophic happened to Tampa, all the sites would go down until new servers could be brought online and data recovered from backup. So the Ops team chose a new data center location, in Ashburn, Virginia, and started preparing to integrate it into our infrastructure. But the preparation of EQIAD, which began in 2011, turned out to require much more work than the Operations and Platform engineering teams had foreseen.

We had never set up a data center of this complexity from scratch before. The systems in Tampa were “layers of duct tape that had been built up over years… Our first problem was that, for example, very little was in Puppet,” Peter said. To configure the Wikimedia servers, we use Puppet, a configuration management system, which lets us write code (Puppet “manifests”) that manages all of our servers like a single large application (and more easily track, troubleshoot, and revert changes).

Since the new data center would exactly mirror the old one, leveraging the power of Puppet to keep our configurations in sync would be crucial. But since our infrastructure included dozens of services that weren’t in Puppet yet, we had to examine each of their configurations to “puppetize” them. And in early 2011, Peter noted, “our whole search infrastructure existed outside of Puppet control. Our Puppet manifests for our databases were a file that just had a comment that said ‘domas is a slacker.’”

In short, Wikimedia needed not only to replicate the functionality that had been incrementally added over ten years, but to refactor it into an automatable form so that the third, fourth, etc. replications would be far easier. So, in addition to the Ops team’s day-to-day responsibilities for site maintenance and crisis management, Ops and Platform teams needed to find hundreds or thousands of staff-hours to refactor, automate and add monitoring to all the services it provided. We aren’t done yet with our “mass puppetization” investment, which we’ve been working on for at least two years.

The core application (MediaWiki) is only one of the myriad moving parts that needed attention; over the past two years, we’ve puppetized and strengthened databases, search, fundraising code, logging and analytics tools, caches, the Nagios monitoring software and dozens of other services. Take search as an example: several years ago, the Wikimedia Foundation used one search server to cover nearly all the wikis other than English Wikipedia — a dangerous single point of failure. Peter arrived at the Foundation and found that none of the search infrastructure was puppetized. After he worked significantly on search, as of November 2012, he noted we had “two fully independent search setups, one in each data center. Fail-over takes a couple of minutes at most.”

Puppet Tutorial: Video from the Wikimedia Foundation tech days, September 11, 2012, explaining Puppet configuration management in the context of Wikimedia’s site/services infrastructure. Speaker/slides: Ryan Lane.

Puppetizing the configuration files, and using Gerrit to manage code review and approval also gave us better transparency and helped staff and volunteers collaborate better on improvements, maintenance and troubleshooting. Anyone can see how our servers are configured, read the Puppet configuration “manifests,” propose new changes and view and comment on pending proposals.

In contrast, “when I got here, everything was done on a local Subversion repository or our puppetmaster, and then pushed out from there, which kinda works if you have 6 or fewer people,” Peter said. (The Puppetmaster is the master repository that instructs all the other boxes in the cluster to update their manifests, and thus updates their packages and configurations.) To keep track of configuration changes, people simply used an IRC bot to log summaries of their actions to the server admin log, which made it hard to revert changes or help train new teammates. “But also, when the Ops team is only 6 people, and everyone has been around for years, everyone just knows all the parts,” he explained.

As they created the 700+ hostclasses currently defined in Puppet, Operations engineers moved towards treating our infrastructure as a codebase, and thus from pure systems administration towards a DevOps approach. As of November 2012, “we’re very nearly at a point where we can manage our whole infrastructure without needing to log into hosts, which is the whole goal,” Peter said with a smile. Logging into hosts is a bad thing “because it means that you’re doing things by hand and/or that what you’re doing isn’t going through code review. Moving to Gerrit for our Puppet repos is awesome: It means I can really easily see what my coworkers are doing. I can ask for review when needed. It’s a huge sign of maturation of our department.”

Their years of work have led to a nearly painless data center migration, but it also began paying off immediately with reduced downtime. You’ll read more about that in the second part of this story next week.

Sumana Harihareswara, Engineering Community Manager

Wikimedia sites to move to primary data center in Ashburn, Virginia

(Update on January 22nd, 2013, 20:00 (UTC): Our Operations team considers the migration to be over. Major disruption is no longer expected.)

Close-up on Wikimedia Foundation Servers

All Wikimedia sites, including Wikipedia, may encounter temporary interruptions on January 22–24, as they transition to servers in a new data center in Ashburn, Virginia (see more photos).

Next week, the Wikimedia Foundation will transition its main technical operations to a new data center in Ashburn, Virginia, USA. This is intended to improve the technical performance and reliability of all Wikimedia sites, including Wikipedia.

Engineering teams have been preparing for the migration to minimize inconvenience to our users, but major service disruption is still expected during the transition. Our sites will be in read-only mode for some time, and may be intermittently inaccessible. Users are advised to be patient during those interruptions, and share information in case of continued outage or loss of functionality.

The current target windows for the migration are January 22nd, 23rd and 24th, 2013, from 17:00 to 01:00 UTC (see other timezones on

Wikimedia sites have been hosted in our main data center in Tampa, Florida, since 2004; before that, the couple of servers powering Wikipedia were in San Diego, California. Ashburn is the third and newest primary data center to host Wikimedia sites.

A major reason for choosing Tampa, Florida as the location of the primary data center in 2004 was its proximity to founder Jimmy Wales’ home, at a time when he was much more involved in the technical operations of the site. In 2009, the Wikimedia Foundation’s Technical Operations team started to look for other locations with better network connectivity and more clement weather. Located in the Washington, D.C. metropolitan area, Ashburn offers faster and more reliable connectivity than Tampa, and usually fewer hurricanes.

The Operations team started to plan and prepare for the Virginia data center in Summer 2010. The actual build-out and racking of servers at the colocation facility started in February 2011, and was followed by a long period of hardware, system and software configuration. Traffic started to be served to users from the Ashburn data center in November 2011, in the form of CSS and JavaScript assets (served from ““).

We reached a major milestone in February 2012, when caching servers were set up to handle read-only requests for Wikipedia and Wikimedia content, which represent most of the traffic to Wikipedia and its sister sites. In April 2012, the Ashburn data center also started to serve media files (from ““).

Cacheable requests represent about 90 percent of our traffic, leaving 10 percent that requires interaction with our web (Apache) and database (MySQL) servers, which are still being hosted in Tampa. Until now, every edit made to a Wikipedia page has been handled by the servers in Tampa. This dependency on our Tampa data center was responsible for the site outage in August 2012, when a fiber cut severed the connection between our two locations.

Starting next week, the new servers in Ashburn will take on that role as well, and all our sites will be able to function fully without relying on the servers in Florida. The legacy data center in Tampa will continue to be maintained, and will serve as a secondary “hot failover” data center: servers will be in standby mode to take over, should the primary site experience an outage. Server configuration and data will be synchronized between the two locations to ensure a transition as smooth as possible in case of technical difficulties in Ashburn.

Besides just installing newer hardware, setting up the data center in Ashburn has also been an opportunity for architecture overhauls, like incremental improvements of the text storage system, and the move to an entirely new media storage system to keep up with the growth of the content generated and curated by our contributors.

Wikimedia’s technical infrastructure aims to be as open and collaborative as the sites it powers. Most of the configuration of our servers is publicly accessible, and the Wikimedia Labs initiative allows contributors to test and submit improvements to the sites’ configuration files.

The Wikimedia Foundation currently operates a total of about 885 servers, and serves about 20 billion page views a month, on a non-profit budget that relies almost entirely on donations from readers.

Guillaume Paumier
Technical Communications Manager