Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

Operations

Migrating Wikimedia Labs to a new Data Center

As part of ongoing efforts to reduce our reliance on our Tampa, Florida data center, we have just moved Wikimedia Labs to EQIAD, the new data center in Ashburn, Virginia. This migration was a multi-month project and involved hard work on the part of dozens of technical volunteers. In addition to reducing our reliance on the Tampa data center, this move should provide quite a few benefits to the users and admins of Wikimedia Labs and Tool Labs.

Migration objectives

We had several objectives for the move:

  1. Upgrade our virtualization infrustructure to use OpenStack Havana;
  2. Minimize project downtime during the move;
  3. Stop relying on nova-network and start using Neutron;
  4. Convert the Labs data storage system from GlusterFS to NFS;
  5. Identify abandoned and disused Labs resources.

Upgrade and Minimize Downtime

Wikimedia Labs uses OpenStack to manage the virtualization back-end. The Tampa Labs install was running a slightly old version of OpenStack, ‘Folsom’. Folsom is more than a year old now, but OpenStack does not provide an in-place upgrade path that doesn’t require considerable downtime, so we’ve been living with Folsom to avoid disrupting existing Labs services.

Similarly, a raw migration of Labs from one set of servers to another would have required extensive downtime, as simply copying all of the data would be the work of days.

The solution to both 1) and 2) was provided by OpenStack’s multi-region support. We built an up-to-date OpenStack install (version ‘havana’) in the Ashburn center and then modified our Labs web interface to access both centers at once. In order to ease the move, Ryan Lane wrote an OpenStack tool that allowed users to simultaneously authenticate in both data centers, and updated the Labs web interface so that both data centers were visible at the same time.

At this point (roughly a month ago), we had two different clouds running: one full and one empty. Because of a shared LDAP back-end, the new cloud already knew about all of our projects and users.

Two clouds, before migration

Then we called on volunteers and project admins for help. In some cases, volunteers built fresh new Labs instances in Ashburn. In other cases, instances were shut down in Tampa and duplicated using a simple copy script run by the Wikimedia Operations team. In either case, project functions were supported in both data centers at once so that services could be switched over quickly and at the convenience of project admins.

Two clouds, during migration

As of today, over 50 projects have been copied to or rebuilt in Ashburn. For those projects with uptime requirements, the outages were generally limited to a few minutes.

Switch to OpenStack Neutron

We currently rely on the ‘nova-network’ service to manage network access between Labs instances. Nova-network is working fine, but OpenStack has introduced a new network service, Neutron, which is intended to replace nova-network. We hoped to adopt Neutron in the Ashburn cloud (largely in order to avoid being stuck using unsupported software), but quickly ran into difficulties. Our current use case (flat DHCP with floating IP addresses) is not currently supported in Neutron, and OpenStack designers seem to be wavering in their decision to deprecate nova-network.

After several days of experimentation, expedience won out and we opted to reproduce the same network setup in Ashburn that we were using in Tampa. We may or may not attempt an in-place switch to Neutron in the future, depending on whether or not nova-network continues to receive upstream support.

Switch to NFS storage

Most Labs projects have shared project-wide volume for storing files and transferring data between instances. In the original Labs setup, these shared volumes used GlusterFS. GlusterFS is easy to administer and designed for use cases similar to ours, but we’ve been plagued with reliability issues: in recent months, the lion’s share of Labs failures and downtime were the result of Gluster problems.

When setting up Tool Labs last year and facing our many issues with GlusterFS, Marc-Andre Pelletier opted to set up a new NFS system to manage shared volumes for the Tool Labs project. This work has paid off with much-improved stability, so we’ve adopted a similar system for all projects in Ashburn.

Again, we largely relied on volunteers and project admins to transfer files between the two systems. Most users were able to copy their data over as needed, scping or rsyncing between Tampa and Ashburn instances. As a hedge against accidental data loss, the old Gluster volumes were also copied over into backup directories in Ashburn using a simple script. The total volume of data copied was around 30 Terabytes; given the many-week migration period, network bandwidth between locations turned out not to be a problem.

Identify and reclaim wasted space

Many Labs projects and instances are set up for temporary experiments, and have a short useful life. The majority of them are cleaned up and deleted after use, but Labs still has a tendency to leak resources as the odd instance is left running without purpose.

We’ve never had a very good system for tracking which projects are or aren’t in current use, so the migration was a good opportunity to clean house. For every project that was actively migrated by staff or volunteers, another project or two simply sat in Tampa, unmentioned and untouched. Some of these projects may yet be useful (or might have users but no administrators), so we need to be very careful about prematurely deleting them.

Projects that were not actively migrated (or noticed, or mentioned) during the migration period have been ‘mothballed’. That means that their storage and VMS were copied to Ashburn, but are left in a shutdown state. These instances will be preserved for several months, pending requests for their revival. Once it’s clear that they’re fully abandoned (in perhaps six months), they will be deleted and the space reused for future projects.

Conclusions

In large part, this migration involved a return to older, more tested technology. I’m still hopeful that in the future Labs will be able to make use of more fundamentally cloud-designed technologies like distributed file shares, Neutron, and (in a perfect world) live instance migration. In the meantime, though, the simple approach of setting up parallel clouds and copying things across has gone quite well.

This migration relied quite heavily on volunteer assistance, and I’ve been quite charmed by how gracious the vast majority of volunteers were about this inconvenience. In many cases, project admins regarded the migration as a positive opportunity to build newer, cleaner projects in Ashburn, and many have expressed high hopes for stability in the new data center. With a bit of luck we’ll prove this optimism justified.

Andrew Bogott, DevOps Engineer

Request for proposals: New datacenter in the continental US

The Wikimedia Foundation’s Technical Operations team is seeking proposals on the provisioning of a new datacenter facility.

After working through the specifics internally, we now have a public RFP posted and ready for proposals. We invite any organization meeting the requirements outlined to submit a proposal for review. Most of the relevant details are in the document itself, but feel free to reach out to myself or anyone on the Technical Operations team should anyone have any questions.

Please, feel free to forward this link far and wide; have colleagues, contacts or friends in the datacenter sector? Then please, forward it on! :)

Below are the primary requirements, excerpted from the RFP:

Primary Requirements

  • The data center location must be in the midwestern/western continental US (i.e., Chicago westward).
  • The capacity for at least 32 enclosures initially; expansion possibilities (first right of refusal in contract on adjacent or nearby cage area) for another row of 8.
  • (more…)

HTTPS by default beta program

This post is available in 2 languages:
Español 7% • English 100%

English

Now that we’ve enabled HTTPS by default for logged-in users, our next major objective is to enable HTTPS by default for anonymous users. We have a number of steps to take to arrive at this goal, including a couple important initial steps, such as conducting proper testing for load and shaking out bugs on smaller scale deployments before mass deployment.

For both load testing and shaking out bugs, we’d like to switch some Wikimedia projects to HTTPS by default. We’re launching a beta program where you can opt your project in for HTTPS by default testing. Signing up for this program doesn’t necessarily mean your project will be selected in the first rounds of testing, but it will put your project into a pool of wikis we can select from.

To sign up for the program, please get consensus from the contributor community on your project and add it to the list on meta.

Ryan Lane
Operations Engineer, Wikimedia Foundation

(more…)

HTTPS enabled by default for logged-in users on Wikimedia sites

This post is available in 2 languages:
Español 7% • English 100%

English

Today, August 28, the Wikimedia Foundation is making a change to the software that powers the Wikimedia projects: By default, all logged-in users will now be using HTTPS to access Wikimedia sites. What this does is encrypt the connection between the Wikimedia servers and the user’s browser so that the information sent between the two is not readable by anyone else. This is in response to the recent concerns over the privacy and security of our user community, and we explained the rationale for this change in our post about the future of HTTPS at Wikimedia.

What this means for you

How this works is simple: If a user wants to log in, they will be redirected to use HTTPS for the login, thus keeping their username and password secure. After they are logged in, they stay on the HTTPS version of the Wikimedia site they are using.

Excluded Countries

Some users live in areas where HTTPS is not an easy option, most times because of explicit blocking by a government. At the request of these communities, we have made an explicit exclusion for users from those affected countries. Simply put, users from China and Iran will not be required to use HTTPS for logging in, nor for viewing any Wikimedia project site.

Disabling

Are you having a slow or unreliable experience while browsing Wikimedia sites over HTTPS? Then you can turn HTTPS off in your user preferences, under the “User profile” tab: Uncheck “Always use a secure connection when logged in”. You will need to log out and log in again for the preference to take effect. But remember, you will still need to log in using the secure HTTPS process.

HELP!

For further details, please see the HTTPS page on Meta-Wiki, which is available in several languages.

Are you unable to log in and edit a Wikimedia wiki after this change? Please contact the Wikimedia Foundation Operations team via any means you find comfortable, including this blog post’s comments section, on IRC in the #wikimedia-operationsconnect channel, or via the https@wikimedia.org email address.

Greg Grossmeier
Release Manager, Wikimedia Foundation

(more…)

The future of HTTPS on Wikimedia projects

This post is available in 2 languages: 中文 7%English 7%

English

The Wikimedia Foundation believes strongly in protecting the privacy of its readers and editors. Recent leaks of the NSA’s XKeyscore program have prompted our community members to push for the use of HTTPS by default for the Wikimedia projects. Thankfully, this is already a project that was being considered for this year’s official roadmap and it has been on our unofficial roadmap since native HTTPS was enabled.

Our current architecture cannot handle HTTPS by default, but we’ve been incrementally making changes to make it possible. Since we appear to be specifically targeted by XKeyscore, we’ll be speeding up these efforts. Here’s our current internal roadmap:

  1. Redirect to HTTPS for log-in, and keep logged-in users on HTTPS. This change is scheduled to be deployed on August 21, at 16:00 UTC. Update as of 21 August: we have delayed this change and will now deploy it on Wednesday, August 28 at 20:00 UTC/1pm PT.
  2. Expand the HTTPS infrastructure: Move the SSL terminators directly onto the frontend varnish caches, and expand the frontend caching clusters as necessitated by increased load.
  3. Put in engineering effort to more properly distribute our SSL load across the frontend caches. In our current architecture, we’re using a source hashing based load balancer to allow for SSL session resumption. We’ll switch to an SSL terminator that supports a distributed SSL cache, or we’ll add one to our current solution. Doing so will allow us to switch to a weighted round-robin load balancer and will result in a more efficient SSL cache.
  4. Starting with smaller projects, slowly soft-enable HTTPS for anonymous users by default, gradually moving toward soft-enabling it on the larger projects as well. By soft-enable we mean changing our rel=canonical links in the head section of our pages to point to the HTTPS version of pages, rather than the HTTP versions. This will cause search engines to return HTTPS results, rather than HTTP results.
  5. Consider enabling perfect forward secrecy. Enabling perfect forward secrecy is only useful if we also eliminate the threat of traffic analysis of HTTPS, which can be used to detect a user’s browsing activity, even when using HTTPS.
  6. Consider doing a hard-enable of HTTPS. By hard-enable we mean force redirecting users from HTTP pages to the HTTPS versions of those pages. A number of countries, China being the largest example, completely block HTTPS to Wikimedia projects, so doing a hard-enable of HTTPS would probably block large numbers of users from accessing our projects at all. Because of this, we feel this action would probably do more harm than good, but we’ll continue to evaluate our options here.
  7. Consider enabling HTTP Strict Transport Security (HSTS) to protect against SSL-stripping man-in-the-middle attacks. Implementing HSTS could also lead to our projects being inaccessible for large numbers of users as it forces a browser to use HTTPS. If a country blocks HTTPS, then every user in the country that received an HSTS header would effectively be blocked from the projects.

Currently we don’t have time frames associated with any change other than redirecting logged-in users to HTTPS, but we will be making time frames internally and will update this post at that point.

Until HTTPS is enabled by default, we urge privacy-conscious users to use HTTPS Everywhere or Tor [1].

Ryan Lane
Operations Engineer, Wikimedia Foundation

[1]: There are restrictions with Tor; see Wikipedia’s information on this.

(more…)

Wikipedia Adopts MariaDB

This past Wednesday marked a milestone in the evolution of Wikimedia’s Database infrastructure: the completion of the migration of the English and German Wikipedias, as well as Wikidata, to MariaDB 5.5.

For the last several years, we’ve been operating the Facebook fork of MySQL 5.1 with most of our production environment running a build of r3753. We’ve been pleased with its performance; Facebook’s MySQL team contains some of the finest database engineers in the industry and they’ve done much to advance the open source MySQL ecosystem.

That said, MariaDB’s optimizer enhancements, the feature set of Percona’s XtraDB (many overlap with the Facebook patch, but I particularly like add-ons such as the ability to save the buffer pool LRU list, avoiding costly warmups on new servers), and of Oracle’s MySQL 5.5 provide compelling reasons to consider upgrading. Equally important, as supporters of the free culture movement, the Wikimedia Foundation strongly prefers free software projects; that includes a preference for projects without bifurcated code bases between differently licensed free and enterprise editions. We welcome and support the MariaDB Foundation as a not-for-profit steward of the free and open MySQL related database community.

Preparing For Change

Major version upgrades of a production database are not to be made lightly. In fact, as late as 2011, some Wikipedia languages were still running a heavily patched version of MySQL 4.0 — the migration to 5.1 required both schema changes, and direct modifications of data dumps to alter the padding of binary-typed columns. MySQL 5.5 contains a variety of incompatibilities with prior versions, thanks in part to better compliance with SQL standards. Changes to the query optimizer between versions may also change the execution plan for common queries, sometimes for the better but historically, sometimes not. SQL behavior changes may result in replication breakage or data consistency issues, while performance regressions, whether from query plan or other changes, can cause site outages. This calls for a lot of testing.

Compatibility testing was accomplished by running MariaDB replicas outside of production, watching for replication errors, replaying production read queries and validating results. After identifying and fixing a couple of MediaWiki issues that surfaced as replication errors (along the lines of trying to set unsigned integer types to negative values which previously caused a wrap-around instead of an error) we replayed production read queries using pt-upgrade from Percona Toolkit. Pt-upgrade replays a query log against two servers, and compares the responses for variances or errors. Scripts originally developed for our recent datacenter migration to simultaneously warmup many standby databases from current production read traffic helped with rough load testing and benchmarking. Along the way, a pair of bugs in MariaDB 5.5.28 and 5.5.29 were identified, one of which was a rare but potentially severe performance regression related to a new query optimizer feature. The MariaDB team was very responsive and quick to offer solutions, complete with test cases.

Performance Testing In Production

As a read-heavy site, Wikipedia aggressively uses edge caching. Approximately 90% of pageviews are served entirely from the edge while at the application layer, we utilize both memcached and redis in addition to MySQL. Despite that, the MySQL databases serving English Wikipedia alone reach a daily peak of ~50k queries/second. Most are read queries served by load-balanced slaves, depending on consistency requirements. 80% of the English Wikipedia query load (up to 40k qps) are typically handled by just two database servers at any given time. Our most common query type (40% of all) has a median execution time of ~0.2ms and a 95th percentile time of ~50ms. To successfully use MariaDB in production, we need it to keep up with the level of performance obtained from Facebook’s MySQL fork, and to behave consistently as traffic patterns change.

Ishmael views of pt-query-digest data collected via tcpdump for the most common Wikipedia read queries (pdf). The first page of a query shows data from db1042, running mysql-facebook-r3753, the second from db1043 over the same time period, running MariaDB 5.5.30.

Ishmael views of pt-query-digest data collected via tcpdump for the most common Wikipedia read queries (pdf). The first page of a query shows data from db1042, running 5.1fb-r3753, the second from db1043 over the same time period, running MariaDB 5.5.30.

Once confident that application compatibility issues were solved and comfortable with performance obtained under benchmark conditions, it was time to test in production. One of the production read slaves from the English Wikipedia shard was taken out of rotation, upgraded to MariaDB 5.5.30, and then returned for warmup. The load balancer weight was then gradually increased until it and a server still running MySQL 5.1-facebook-r3753 were equally weighted and receiving most of the query load.

Also from the Percona Toolkit, we use pt-query-digest across all database servers to collect query performance data which is then stored in a centralized database. Query data is collected from two sources per server and stored in separate buckets — from the slow query which only captures queries exceeding 450ms, and from periodic brief sampling of all queries obtained by tcpdump. Ishmael provides a convenient way to visualize and inspect query digest data over time. Using it, along with direct analysis of the raw data, allowed us to validate that every query continued to perform within acceptable bounds.

For our most common query type, 95th percentile times over an 8-hour period dropped from 56ms to 43ms and the average from 15.4ms to 12.7ms. 50th percentile times remained a bit better with the 5.1-facebook build over the sample period, 0.185ms vs. 0.194ms. Many query types were 4-15% faster with MariaDB 5.5.30 under production load, a few were 5% slower, and nothing appeared aberrant beyond those bounds.

From there, we upgraded the remaining slaves one by one, before finally rotating in a newer upgraded class of servers to act as masters. The switch was seamless and performance continues to look good. We’ll be completing the migration of shards covering the rest of our projects over the next month. Beyond that, we’re looking forward to the future release of MariaDB 10 (global transaction IDs!), and are continually assessing ways to improve our data storage infrastructure. If you’re interested in helping, the Wikimedia Foundation is hiring!

Asher Feldman, Site Architect

How the Technical Operations team stops problems in their tracks

Last week, you read about how Wikimedia Foundation’s Technical Operations team (“Ops”) spent hundreds or thousands of staff hours to refactor and automate all the services it provides, to prepare for the January data center migration. One reward from that work: our sites were not down as often, and when they were, downtime was for better reasons.

“Another thing that illustrates our growth and maturity is our downtime,” says Operations engineer Peter Youngmeister. “Something that’s less visible to people outside of Ops is the kind of downtime we have. For example, we no longer have much downtime of the variety of ‘Oops, bumped that cable’ or ‘That one box died,’ because things are much more robust now, much more redundant. A lot of that is a product of the massive automation push we’ve been going through, which lets us create redundancy far more easily, and lets us spend our time not fighting fires.”

Wikimedia Foundation engineer Roan Kattouw adds: “Or, ‘the master DB server has a full disk’ — that one happened a few times a few years ago, and doesn’t happen any more now.”

To fix crises fast, we need monitoring: tools that automatically check for problems and alert our engineers when something is broken. In the very early days of our sites, we simply trusted that there would usually be a sysadmin online and available in case someone noticed a problem and complained on IRC. Several years ago, we began to use Nagios for monitoring and assigned a “pager duty” rotation to decide who might be woken up by a crisis.

Nagios runs coarse automated tests on the behavior of our site (such as “Does port 80 return an HTTP 301?”) and checks certain key numbers to make sure they’re within the desired range (for instance, to test whether we’re running out of memory). If a test fails, Nagios sends out email, IRC, and SMS alarms.

Monitoring helps us address the crisis faster, but it often doesn’t help with the actual problem-solving.

“Nagios is great for telling you when things are broken, and crap for telling you why,” Peter explains. “The work that Asher Feldman has done creating profiling data is more useful.”

Monitoring our servers (here in Ashburn, Virginia) helps to minimize outages and services disruptions.

Monitoring our servers (here in Ashburn, Virginia) helps to minimize outages and services disruptions.

As Roan puts it: “Profiling is the act of generating data on ‘How much time does large task X spend doing small subtask Y?’ The reason for that is that 1) one of those small Ys might actually be not so small, and be a problem, and 2) per the 80-20 rule, for some Ys, optimization will have a larger impact, so you wanna find those.” Profiling generates knowledge about the behavior of our systems, so that engineers can better understand how the cluster should be operating, and offers data points for troubleshooting.

We use two profiling systems to get time-series performance data: Ganglia at the “host” level, and Graphite at the “application” level (get a Labs login to see Graphite). In the past two years, we’ve configured Ganglia to cover much more data, and in 2012 began to use Graphite. The better data makes it more useful for troubleshooting, and Director of Operations CT Woo regularly checks the dashboard to look out for upcoming problems and alert his team. This reduces downtime.

For example, on one ganglia page, we previously only had access to host data: free disk, load, etc. We have recently added the Apache-specific data, such as requests per second and number of idle threads. This additional information aids sysadmins in troubleshooting. “One can look at it and make better deductions than just ‘Yup, server’s under a lot of load…’,” explains Peter.

Like puppetization, improvements in profiling were an investment by the Ops team. “There’s a plug-in for Ganglia that does Apache performance stats. It took me a couple of hours to set it all up. But, again, that’s being forward-thinking, debt that we had to work off instead of just cursing ourselves when it wasn’t there when we needed it. It’s a massive undertaking to decide to do things The Right Way, set up a platform, instead of doing a million one-offs.”

While puppetizing and improving monitoring and profiling to prepare for the data center migration, the Operations team had to defer other non-urgent work. “Ops was less able to give support to many teams,” says Peter. “For example, Fundraising just had a couple of boxes and could do whatever they wanted on them, as opposed to now where [Operations Engineer] Jeff Green is working on making an awesome, PCI-compliant system with them full time. Or, Analytics was very independent/unsupported, because there were so little human-hours to give to supporting things that weren’t just keeping the site up… I think that the EQIAD [Virginia Data center] build-out is very demonstrative of the amount of [technical] debt that Ops was in.”

Now, Peter is looking forward to seeing Wikimedia “spin up more data centers dramatically more quickly.” The Operations team is making preparations for an additional data center on North America’s west coast. Site Architect Asher Feldman sees a “continuing arc of refinement” in the team’s future, rather than “challenges that end, to be replaced by new ones.” “The challenges of making MediaWiki scale aren’t going to go away any time soon; nor will the need for incremental architecture modernization at multiple levels.” For instance, Ops needs to continue puppetizing certain services; some modules also need their Puppet manifests tweaked so that they work not just on the main site, but also in Wikimedia Labs.

You can check out the Operation’s team 2012–2013 goals to find out more about what’s next (including improvements in search and security).

Sumana Harihareswara, Engineering Community Manager

From duct tape to puppets: How a new data center became an opportunity to do things right

Last week, the Wikimedia Foundation flipped a historic switch: we transitioned our main technical services to a shiny new data center in Ashburn, Virginia. For the first time since 2004, Wikimedia sites are no longer primarily hosted in Tampa, Florida.

Peter Youngmeister works in the Wikimedia Foundation's Technical Operations team.

Peter Youngmeister works in the Wikimedia Foundation’s Technical Operations team.

To help understand this grueling journey (and why it’s crucial), look through the eyes of Wikimedia Foundation engineer Peter Youngmeister. Peter joined the Wikimedia Foundation’s Technical Operations team (“Ops”) about two years ago, in March 2011. At the time, “the team” meant “about six engineers supporting the fifth-most visited site on the Web,” said Peter. The Foundation has now increased its Ops team to 14, and has several job openings.

“This also meant that out of the fast/cheap/well triangle, we’d gone with fast and cheap,” Peter recalled. We made quick-and-dirty solutions because problems had to be solved immediately. “With so few Ops engineers, you’re always playing catchup; long-term is hard.” He said that the digital infrastructure when he arrived was “kinda like many many layers of really artfully applied duct tape.”

And the biggest, most pressing flaw: Wikimedia only had one fully functional primary data center, in Tampa, Florida. If something catastrophic happened to Tampa, all the sites would go down until new servers could be brought online and data recovered from backup. So the Ops team chose a new data center location, in Ashburn, Virginia, and started preparing to integrate it into our infrastructure. But the preparation of EQIAD, which began in 2011, turned out to require much more work than the Operations and Platform engineering teams had foreseen.

We had never set up a data center of this complexity from scratch before. The systems in Tampa were “layers of duct tape that had been built up over years… Our first problem was that, for example, very little was in Puppet,” Peter said. To configure the Wikimedia servers, we use Puppet, a configuration management system, which lets us write code (Puppet “manifests”) that manages all of our servers like a single large application (and more easily track, troubleshoot, and revert changes).

Since the new data center would exactly mirror the old one, leveraging the power of Puppet to keep our configurations in sync would be crucial. But since our infrastructure included dozens of services that weren’t in Puppet yet, we had to examine each of their configurations to “puppetize” them. And in early 2011, Peter noted, “our whole search infrastructure existed outside of Puppet control. Our Puppet manifests for our databases were a file that just had a comment that said ‘domas is a slacker.’”

In short, Wikimedia needed not only to replicate the functionality that had been incrementally added over ten years, but to refactor it into an automatable form so that the third, fourth, etc. replications would be far easier. So, in addition to the Ops team’s day-to-day responsibilities for site maintenance and crisis management, Ops and Platform teams needed to find hundreds or thousands of staff-hours to refactor, automate and add monitoring to all the services it provided. We aren’t done yet with our “mass puppetization” investment, which we’ve been working on for at least two years.

The core application (MediaWiki) is only one of the myriad moving parts that needed attention; over the past two years, we’ve puppetized and strengthened databases, search, fundraising code, logging and analytics tools, caches, the Nagios monitoring software and dozens of other services. Take search as an example: several years ago, the Wikimedia Foundation used one search server to cover nearly all the wikis other than English Wikipedia — a dangerous single point of failure. Peter arrived at the Foundation and found that none of the search infrastructure was puppetized. After he worked significantly on search, as of November 2012, he noted we had “two fully independent search setups, one in each data center. Fail-over takes a couple of minutes at most.”

Puppet Tutorial: Video from the Wikimedia Foundation tech days, September 11, 2012, explaining Puppet configuration management in the context of Wikimedia’s site/services infrastructure. Speaker/slides: Ryan Lane.

Puppetizing the configuration files, and using Gerrit to manage code review and approval also gave us better transparency and helped staff and volunteers collaborate better on improvements, maintenance and troubleshooting. Anyone can see how our servers are configured, read the Puppet configuration “manifests,” propose new changes and view and comment on pending proposals.

In contrast, “when I got here, everything was done on a local Subversion repository or our puppetmaster, and then pushed out from there, which kinda works if you have 6 or fewer people,” Peter said. (The Puppetmaster is the master repository that instructs all the other boxes in the cluster to update their manifests, and thus updates their packages and configurations.) To keep track of configuration changes, people simply used an IRC bot to log summaries of their actions to the server admin log, which made it hard to revert changes or help train new teammates. “But also, when the Ops team is only 6 people, and everyone has been around for years, everyone just knows all the parts,” he explained.

As they created the 700+ hostclasses currently defined in Puppet, Operations engineers moved towards treating our infrastructure as a codebase, and thus from pure systems administration towards a DevOps approach. As of November 2012, “we’re very nearly at a point where we can manage our whole infrastructure without needing to log into hosts, which is the whole goal,” Peter said with a smile. Logging into hosts is a bad thing “because it means that you’re doing things by hand and/or that what you’re doing isn’t going through code review. Moving to Gerrit for our Puppet repos is awesome: It means I can really easily see what my coworkers are doing. I can ask for review when needed. It’s a huge sign of maturation of our department.”

Their years of work have led to a nearly painless data center migration, but it also began paying off immediately with reduced downtime. You’ll read more about that in the second part of this story next week.

Sumana Harihareswara, Engineering Community Manager

Wikimedia sites to move to primary data center in Ashburn, Virginia

(Update on January 22nd, 2013, 20:00 (UTC): Our Operations team considers the migration to be over. Major disruption is no longer expected.)

Close-up on Wikimedia Foundation Servers

All Wikimedia sites, including Wikipedia, may encounter temporary interruptions on January 22–24, as they transition to servers in a new data center in Ashburn, Virginia (see more photos).

Next week, the Wikimedia Foundation will transition its main technical operations to a new data center in Ashburn, Virginia, USA. This is intended to improve the technical performance and reliability of all Wikimedia sites, including Wikipedia.

Engineering teams have been preparing for the migration to minimize inconvenience to our users, but major service disruption is still expected during the transition. Our sites will be in read-only mode for some time, and may be intermittently inaccessible. Users are advised to be patient during those interruptions, and share information in case of continued outage or loss of functionality.

The current target windows for the migration are January 22nd, 23rd and 24th, 2013, from 17:00 to 01:00 UTC (see other timezones on timeanddate.com).

Wikimedia sites have been hosted in our main data center in Tampa, Florida, since 2004; before that, the couple of servers powering Wikipedia were in San Diego, California. Ashburn is the third and newest primary data center to host Wikimedia sites.

A major reason for choosing Tampa, Florida as the location of the primary data center in 2004 was its proximity to founder Jimmy Wales’ home, at a time when he was much more involved in the technical operations of the site. In 2009, the Wikimedia Foundation’s Technical Operations team started to look for other locations with better network connectivity and more clement weather. Located in the Washington, D.C. metropolitan area, Ashburn offers faster and more reliable connectivity than Tampa, and usually fewer hurricanes.

The Operations team started to plan and prepare for the Virginia data center in Summer 2010. The actual build-out and racking of servers at the colocation facility started in February 2011, and was followed by a long period of hardware, system and software configuration. Traffic started to be served to users from the Ashburn data center in November 2011, in the form of CSS and JavaScript assets (served from “bits.wikimedia.org“).

We reached a major milestone in February 2012, when caching servers were set up to handle read-only requests for Wikipedia and Wikimedia content, which represent most of the traffic to Wikipedia and its sister sites. In April 2012, the Ashburn data center also started to serve media files (from “upload.wikimedia.org“).

Cacheable requests represent about 90 percent of our traffic, leaving 10 percent that requires interaction with our web (Apache) and database (MySQL) servers, which are still being hosted in Tampa. Until now, every edit made to a Wikipedia page has been handled by the servers in Tampa. This dependency on our Tampa data center was responsible for the site outage in August 2012, when a fiber cut severed the connection between our two locations.

Starting next week, the new servers in Ashburn will take on that role as well, and all our sites will be able to function fully without relying on the servers in Florida. The legacy data center in Tampa will continue to be maintained, and will serve as a secondary “hot failover” data center: servers will be in standby mode to take over, should the primary site experience an outage. Server configuration and data will be synchronized between the two locations to ensure a transition as smooth as possible in case of technical difficulties in Ashburn.

Besides just installing newer hardware, setting up the data center in Ashburn has also been an opportunity for architecture overhauls, like incremental improvements of the text storage system, and the move to an entirely new media storage system to keep up with the growth of the content generated and curated by our contributors.

Wikimedia’s technical infrastructure aims to be as open and collaborative as the sites it powers. Most of the configuration of our servers is publicly accessible, and the Wikimedia Labs initiative allows contributors to test and submit improvements to the sites’ configuration files.

The Wikimedia Foundation currently operates a total of about 885 servers, and serves about 20 billion page views a month, on a non-profit budget that relies almost entirely on donations from readers.

Guillaume Paumier
Technical Communications Manager

 

We are donating servers

The Wikimedia Foundations has been upgrading and adding new servers for capacity and performance increases to meet the demands of our users. Due to this demand  we are soliciting request from other non-profit organizations that would like to acquire  some of our older systems that are no longer able to keep up with demand. These servers have been used by Wikimedia Foundation on different projects for over 3 years and are no longer covered by their warranty. These are good systems though, and while we may overload them and need replacements, they are more than suitable for many non-profits to use.

Close-up on a rack of Wikimedia servers in our datacenter in Ashburn, Virginia

We’re donating some of our older servers to like-minded US non-profits, who can apply by e-mail to show their interest.

Most systems (but possibly not all) have the following specifications:

  • Dual CPU 2.5 GHz Intel(R) Xeon(R) some may have AMD processors
  • 2-8GB RAM
  • Most servers have multiple HDD
  • A majority are manufactured by Dell
  • Should work fine but not guaranteed (see Disclaimers)

Disclaimers: The Wikimedia Foundation does not guarantee the operation or use of these servers in any shape or form. They are old, some may have dying fans, bad HDD sectors, and the like. Servers have been wiped of information, and they ran through that, but no promises on function! Also, most servers have rails, but occasionally one may not, and we do not sort through them for these things. However, most are standard Dell 1u servers and getting replacement rails is fairly simple. Some servers are well over 3 years old, we do not just turn off servers when they hit the 3 year mark, we turn them off when they are no longer worth using in any role or function on our cluster in a reliable manner. In most cases, it is simply the hardware technology has updated to the point that a new server is much faster, and since we demand high performance of our servers, it is worth upgrading for our needs.

At this time we are only able to donate these servers to U.S. based non-profits  whose core values are similar or in support of our own. This means we do not donate them for individual use. Since these servers were purchased with donations to support the Wikimedia Foundation, we feel we need to further donate them to other like-minded organizations, since that is how the money for the servers was meant to be spent. This means that we cannot, in good conscience, donate these servers for profit or personal use to individuals or corporations.

If you are a US organization and you would like to receive some of these servers for your NON-PROFIT use, please email servers@wikimedia.org, applications should only be made via email. Applications in this post’s comments will NOT be accepted.  TO BE ELIGIBLE, YOUR EMAIL MUST INCLUDE THE FOLLOWING:

  • Subject: Server Decommissioning Donations for <NONPROFIT NAME HERE>
  • Your name, contact information, relationship with non-profit requesting the servers
  • Registered non-profit name and proof of status
  • Organization’s website
  • Information on the non-profit, who they are, what their mission statement and goals are.
  • Shipping address information for a FedEx Ground delivery to where the servers need to go.
  • How the servers will be used. (We like to know and share with folks!)

Please keep in mind that deciding where these go is pretty tough, so the more detailed you can be in your email is best. (i.e. ‘Wikimedia uses these for our sites.’ is pretty vague where ‘Wikimedia is the non-profit foundation that runs Wikipedia. Server donations to us would be used to run our websites that allow access to Wikipedia and its sister projects.’ is a lot nicer. ;). Also, by submitting and possibly accepting servers from us, you are giving us permission to post about it here on our technical blog.

The submission period will remain open no less than two weeks from this posting.

Chris Johnson
Operations Engineer