Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

Operations

Wikipedia Adopts MariaDB

This past Wednesday marked a milestone in the evolution of Wikimedia’s Database infrastructure: the completion of the migration of the English and German Wikipedias, as well as Wikidata, to MariaDB 5.5.

For the last several years, we’ve been operating the Facebook fork of MySQL 5.1 with most of our production environment running a build of r3753. We’ve been pleased with its performance; Facebook’s MySQL team contains some of the finest database engineers in the industry and they’ve done much to advance the open source MySQL ecosystem.

That said, MariaDB’s optimizer enhancements, the feature set of Percona’s XtraDB (many overlap with the Facebook patch, but I particularly like add-ons such as the ability to save the buffer pool LRU list, avoiding costly warmups on new servers), and of Oracle’s MySQL 5.5 provide compelling reasons to consider upgrading. Equally important, as supporters of the free culture movement, the Wikimedia Foundation strongly prefers free software projects; that includes a preference for projects without bifurcated code bases between differently licensed free and enterprise editions. We welcome and support the MariaDB Foundation as a not-for-profit steward of the free and open MySQL related database community.

Preparing For Change

Major version upgrades of a production database are not to be made lightly. In fact, as late as 2011, some Wikipedia languages were still running a heavily patched version of MySQL 4.0 — the migration to 5.1 required both schema changes, and direct modifications of data dumps to alter the padding of binary-typed columns. MySQL 5.5 contains a variety of incompatibilities with prior versions, thanks in part to better compliance with SQL standards. Changes to the query optimizer between versions may also change the execution plan for common queries, sometimes for the better but historically, sometimes not. SQL behavior changes may result in replication breakage or data consistency issues, while performance regressions, whether from query plan or other changes, can cause site outages. This calls for a lot of testing.

Compatibility testing was accomplished by running MariaDB replicas outside of production, watching for replication errors, replaying production read queries and validating results. After identifying and fixing a couple of MediaWiki issues that surfaced as replication errors (along the lines of trying to set unsigned integer types to negative values which previously caused a wrap-around instead of an error) we replayed production read queries using pt-upgrade from Percona Toolkit. Pt-upgrade replays a query log against two servers, and compares the responses for variances or errors. Scripts originally developed for our recent datacenter migration to simultaneously warmup many standby databases from current production read traffic helped with rough load testing and benchmarking. Along the way, a pair of bugs in MariaDB 5.5.28 and 5.5.29 were identified, one of which was a rare but potentially severe performance regression related to a new query optimizer feature. The MariaDB team was very responsive and quick to offer solutions, complete with test cases.

Performance Testing In Production

As a read-heavy site, Wikipedia aggressively uses edge caching. Approximately 90% of pageviews are served entirely from the edge while at the application layer, we utilize both memcached and redis in addition to MySQL. Despite that, the MySQL databases serving English Wikipedia alone reach a daily peak of ~50k queries/second. Most are read queries served by load-balanced slaves, depending on consistency requirements. 80% of the English Wikipedia query load (up to 40k qps) are typically handled by just two database servers at any given time. Our most common query type (40% of all) has a median execution time of ~0.2ms and a 95th percentile time of ~50ms. To successfully use MariaDB in production, we need it to keep up with the level of performance obtained from Facebook’s MySQL fork, and to behave consistently as traffic patterns change.

Ishmael views of pt-query-digest data collected via tcpdump for the most common Wikipedia read queries (pdf). The first page of a query shows data from db1042, running mysql-facebook-r3753, the second from db1043 over the same time period, running MariaDB 5.5.30.

Ishmael views of pt-query-digest data collected via tcpdump for the most common Wikipedia read queries (pdf). The first page of a query shows data from db1042, running 5.1fb-r3753, the second from db1043 over the same time period, running MariaDB 5.5.30.

Once confident that application compatibility issues were solved and comfortable with performance obtained under benchmark conditions, it was time to test in production. One of the production read slaves from the English Wikipedia shard was taken out of rotation, upgraded to MariaDB 5.5.30, and then returned for warmup. The load balancer weight was then gradually increased until it and a server still running MySQL 5.1-facebook-r3753 were equally weighted and receiving most of the query load.

Also from the Percona Toolkit, we use pt-query-digest across all database servers to collect query performance data which is then stored in a centralized database. Query data is collected from two sources per server and stored in separate buckets — from the slow query which only captures queries exceeding 450ms, and from periodic brief sampling of all queries obtained by tcpdump. Ishmael provides a convenient way to visualize and inspect query digest data over time. Using it, along with direct analysis of the raw data, allowed us to validate that every query continued to perform within acceptable bounds.

For our most common query type, 95th percentile times over an 8-hour period dropped from 56ms to 43ms and the average from 15.4ms to 12.7ms. 50th percentile times remained a bit better with the 5.1-facebook build over the sample period, 0.185ms vs. 0.194ms. Many query types were 4-15% faster with MariaDB 5.5.30 under production load, a few were 5% slower, and nothing appeared aberrant beyond those bounds.

From there, we upgraded the remaining slaves one by one, before finally rotating in a newer upgraded class of servers to act as masters. The switch was seamless and performance continues to look good. We’ll be completing the migration of shards covering the rest of our projects over the next month. Beyond that, we’re looking forward to the future release of MariaDB 10 (global transaction IDs!), and are continually assessing ways to improve our data storage infrastructure. If you’re interested in helping, the Wikimedia Foundation is hiring!

Asher Feldman, Site Architect

How the Technical Operations team stops problems in their tracks

Last week, you read about how Wikimedia Foundation’s Technical Operations team (“Ops”) spent hundreds or thousands of staff hours to refactor and automate all the services it provides, to prepare for the January data center migration. One reward from that work: our sites were not down as often, and when they were, downtime was for better reasons.

“Another thing that illustrates our growth and maturity is our downtime,” says Operations engineer Peter Youngmeister. “Something that’s less visible to people outside of Ops is the kind of downtime we have. For example, we no longer have much downtime of the variety of ‘Oops, bumped that cable’ or ‘That one box died,’ because things are much more robust now, much more redundant. A lot of that is a product of the massive automation push we’ve been going through, which lets us create redundancy far more easily, and lets us spend our time not fighting fires.”

Wikimedia Foundation engineer Roan Kattouw adds: “Or, ‘the master DB server has a full disk’ — that one happened a few times a few years ago, and doesn’t happen any more now.”

To fix crises fast, we need monitoring: tools that automatically check for problems and alert our engineers when something is broken. In the very early days of our sites, we simply trusted that there would usually be a sysadmin online and available in case someone noticed a problem and complained on IRC. Several years ago, we began to use Nagios for monitoring and assigned a “pager duty” rotation to decide who might be woken up by a crisis.

Nagios runs coarse automated tests on the behavior of our site (such as “Does port 80 return an HTTP 301?”) and checks certain key numbers to make sure they’re within the desired range (for instance, to test whether we’re running out of memory). If a test fails, Nagios sends out email, IRC, and SMS alarms.

Monitoring helps us address the crisis faster, but it often doesn’t help with the actual problem-solving.

“Nagios is great for telling you when things are broken, and crap for telling you why,” Peter explains. “The work that Asher Feldman has done creating profiling data is more useful.”

Monitoring our servers (here in Ashburn, Virginia) helps to minimize outages and services disruptions.

Monitoring our servers (here in Ashburn, Virginia) helps to minimize outages and services disruptions.

As Roan puts it: “Profiling is the act of generating data on ‘How much time does large task X spend doing small subtask Y?’ The reason for that is that 1) one of those small Ys might actually be not so small, and be a problem, and 2) per the 80-20 rule, for some Ys, optimization will have a larger impact, so you wanna find those.” Profiling generates knowledge about the behavior of our systems, so that engineers can better understand how the cluster should be operating, and offers data points for troubleshooting.

We use two profiling systems to get time-series performance data: Ganglia at the “host” level, and Graphite at the “application” level (get a Labs login to see Graphite). In the past two years, we’ve configured Ganglia to cover much more data, and in 2012 began to use Graphite. The better data makes it more useful for troubleshooting, and Director of Operations CT Woo regularly checks the dashboard to look out for upcoming problems and alert his team. This reduces downtime.

For example, on one ganglia page, we previously only had access to host data: free disk, load, etc. We have recently added the Apache-specific data, such as requests per second and number of idle threads. This additional information aids sysadmins in troubleshooting. “One can look at it and make better deductions than just ‘Yup, server’s under a lot of load…’,” explains Peter.

Like puppetization, improvements in profiling were an investment by the Ops team. “There’s a plug-in for Ganglia that does Apache performance stats. It took me a couple of hours to set it all up. But, again, that’s being forward-thinking, debt that we had to work off instead of just cursing ourselves when it wasn’t there when we needed it. It’s a massive undertaking to decide to do things The Right Way, set up a platform, instead of doing a million one-offs.”

While puppetizing and improving monitoring and profiling to prepare for the data center migration, the Operations team had to defer other non-urgent work. “Ops was less able to give support to many teams,” says Peter. “For example, Fundraising just had a couple of boxes and could do whatever they wanted on them, as opposed to now where [Operations Engineer] Jeff Green is working on making an awesome, PCI-compliant system with them full time. Or, Analytics was very independent/unsupported, because there were so little human-hours to give to supporting things that weren’t just keeping the site up… I think that the EQIAD [Virginia Data center] build-out is very demonstrative of the amount of [technical] debt that Ops was in.”

Now, Peter is looking forward to seeing Wikimedia “spin up more data centers dramatically more quickly.” The Operations team is making preparations for an additional data center on North America’s west coast. Site Architect Asher Feldman sees a “continuing arc of refinement” in the team’s future, rather than “challenges that end, to be replaced by new ones.” “The challenges of making MediaWiki scale aren’t going to go away any time soon; nor will the need for incremental architecture modernization at multiple levels.” For instance, Ops needs to continue puppetizing certain services; some modules also need their Puppet manifests tweaked so that they work not just on the main site, but also in Wikimedia Labs.

You can check out the Operation’s team 2012–2013 goals to find out more about what’s next (including improvements in search and security).

Sumana Harihareswara, Engineering Community Manager

From duct tape to puppets: How a new data center became an opportunity to do things right

Last week, the Wikimedia Foundation flipped a historic switch: we transitioned our main technical services to a shiny new data center in Ashburn, Virginia. For the first time since 2004, Wikimedia sites are no longer primarily hosted in Tampa, Florida.

Peter Youngmeister works in the Wikimedia Foundation's Technical Operations team.

Peter Youngmeister works in the Wikimedia Foundation’s Technical Operations team.

To help understand this grueling journey (and why it’s crucial), look through the eyes of Wikimedia Foundation engineer Peter Youngmeister. Peter joined the Wikimedia Foundation’s Technical Operations team (“Ops”) about two years ago, in March 2011. At the time, “the team” meant “about six engineers supporting the fifth-most visited site on the Web,” said Peter. The Foundation has now increased its Ops team to 14, and has several job openings.

“This also meant that out of the fast/cheap/well triangle, we’d gone with fast and cheap,” Peter recalled. We made quick-and-dirty solutions because problems had to be solved immediately. “With so few Ops engineers, you’re always playing catchup; long-term is hard.” He said that the digital infrastructure when he arrived was “kinda like many many layers of really artfully applied duct tape.”

And the biggest, most pressing flaw: Wikimedia only had one fully functional primary data center, in Tampa, Florida. If something catastrophic happened to Tampa, all the sites would go down until new servers could be brought online and data recovered from backup. So the Ops team chose a new data center location, in Ashburn, Virginia, and started preparing to integrate it into our infrastructure. But the preparation of EQIAD, which began in 2011, turned out to require much more work than the Operations and Platform engineering teams had foreseen.

We had never set up a data center of this complexity from scratch before. The systems in Tampa were “layers of duct tape that had been built up over years… Our first problem was that, for example, very little was in Puppet,” Peter said. To configure the Wikimedia servers, we use Puppet, a configuration management system, which lets us write code (Puppet “manifests”) that manages all of our servers like a single large application (and more easily track, troubleshoot, and revert changes).

Since the new data center would exactly mirror the old one, leveraging the power of Puppet to keep our configurations in sync would be crucial. But since our infrastructure included dozens of services that weren’t in Puppet yet, we had to examine each of their configurations to “puppetize” them. And in early 2011, Peter noted, “our whole search infrastructure existed outside of Puppet control. Our Puppet manifests for our databases were a file that just had a comment that said ‘domas is a slacker.’”

In short, Wikimedia needed not only to replicate the functionality that had been incrementally added over ten years, but to refactor it into an automatable form so that the third, fourth, etc. replications would be far easier. So, in addition to the Ops team’s day-to-day responsibilities for site maintenance and crisis management, Ops and Platform teams needed to find hundreds or thousands of staff-hours to refactor, automate and add monitoring to all the services it provided. We aren’t done yet with our “mass puppetization” investment, which we’ve been working on for at least two years.

The core application (MediaWiki) is only one of the myriad moving parts that needed attention; over the past two years, we’ve puppetized and strengthened databases, search, fundraising code, logging and analytics tools, caches, the Nagios monitoring software and dozens of other services. Take search as an example: several years ago, the Wikimedia Foundation used one search server to cover nearly all the wikis other than English Wikipedia — a dangerous single point of failure. Peter arrived at the Foundation and found that none of the search infrastructure was puppetized. After he worked significantly on search, as of November 2012, he noted we had “two fully independent search setups, one in each data center. Fail-over takes a couple of minutes at most.”

Puppet Tutorial: Video from the Wikimedia Foundation tech days, September 11, 2012, explaining Puppet configuration management in the context of Wikimedia’s site/services infrastructure. Speaker/slides: Ryan Lane.

Puppetizing the configuration files, and using Gerrit to manage code review and approval also gave us better transparency and helped staff and volunteers collaborate better on improvements, maintenance and troubleshooting. Anyone can see how our servers are configured, read the Puppet configuration “manifests,” propose new changes and view and comment on pending proposals.

In contrast, “when I got here, everything was done on a local Subversion repository or our puppetmaster, and then pushed out from there, which kinda works if you have 6 or fewer people,” Peter said. (The Puppetmaster is the master repository that instructs all the other boxes in the cluster to update their manifests, and thus updates their packages and configurations.) To keep track of configuration changes, people simply used an IRC bot to log summaries of their actions to the server admin log, which made it hard to revert changes or help train new teammates. “But also, when the Ops team is only 6 people, and everyone has been around for years, everyone just knows all the parts,” he explained.

As they created the 700+ hostclasses currently defined in Puppet, Operations engineers moved towards treating our infrastructure as a codebase, and thus from pure systems administration towards a DevOps approach. As of November 2012, “we’re very nearly at a point where we can manage our whole infrastructure without needing to log into hosts, which is the whole goal,” Peter said with a smile. Logging into hosts is a bad thing “because it means that you’re doing things by hand and/or that what you’re doing isn’t going through code review. Moving to Gerrit for our Puppet repos is awesome: It means I can really easily see what my coworkers are doing. I can ask for review when needed. It’s a huge sign of maturation of our department.”

Their years of work have led to a nearly painless data center migration, but it also began paying off immediately with reduced downtime. You’ll read more about that in the second part of this story next week.

Sumana Harihareswara, Engineering Community Manager

Wikimedia sites to move to primary data center in Ashburn, Virginia

(Update on January 22nd, 2013, 20:00 (UTC): Our Operations team considers the migration to be over. Major disruption is no longer expected.)

Close-up on Wikimedia Foundation Servers

All Wikimedia sites, including Wikipedia, may encounter temporary interruptions on January 22–24, as they transition to servers in a new data center in Ashburn, Virginia (see more photos).

Next week, the Wikimedia Foundation will transition its main technical operations to a new data center in Ashburn, Virginia, USA. This is intended to improve the technical performance and reliability of all Wikimedia sites, including Wikipedia.

Engineering teams have been preparing for the migration to minimize inconvenience to our users, but major service disruption is still expected during the transition. Our sites will be in read-only mode for some time, and may be intermittently inaccessible. Users are advised to be patient during those interruptions, and share information in case of continued outage or loss of functionality.

The current target windows for the migration are January 22nd, 23rd and 24th, 2013, from 17:00 to 01:00 UTC (see other timezones on timeanddate.com).

Wikimedia sites have been hosted in our main data center in Tampa, Florida, since 2004; before that, the couple of servers powering Wikipedia were in San Diego, California. Ashburn is the third and newest primary data center to host Wikimedia sites.

A major reason for choosing Tampa, Florida as the location of the primary data center in 2004 was its proximity to founder Jimmy Wales’ home, at a time when he was much more involved in the technical operations of the site. In 2009, the Wikimedia Foundation’s Technical Operations team started to look for other locations with better network connectivity and more clement weather. Located in the Washington, D.C. metropolitan area, Ashburn offers faster and more reliable connectivity than Tampa, and usually fewer hurricanes.

The Operations team started to plan and prepare for the Virginia data center in Summer 2010. The actual build-out and racking of servers at the colocation facility started in February 2011, and was followed by a long period of hardware, system and software configuration. Traffic started to be served to users from the Ashburn data center in November 2011, in the form of CSS and JavaScript assets (served from “bits.wikimedia.org“).

We reached a major milestone in February 2012, when caching servers were set up to handle read-only requests for Wikipedia and Wikimedia content, which represent most of the traffic to Wikipedia and its sister sites. In April 2012, the Ashburn data center also started to serve media files (from “upload.wikimedia.org“).

Cacheable requests represent about 90 percent of our traffic, leaving 10 percent that requires interaction with our web (Apache) and database (MySQL) servers, which are still being hosted in Tampa. Until now, every edit made to a Wikipedia page has been handled by the servers in Tampa. This dependency on our Tampa data center was responsible for the site outage in August 2012, when a fiber cut severed the connection between our two locations.

Starting next week, the new servers in Ashburn will take on that role as well, and all our sites will be able to function fully without relying on the servers in Florida. The legacy data center in Tampa will continue to be maintained, and will serve as a secondary “hot failover” data center: servers will be in standby mode to take over, should the primary site experience an outage. Server configuration and data will be synchronized between the two locations to ensure a transition as smooth as possible in case of technical difficulties in Ashburn.

Besides just installing newer hardware, setting up the data center in Ashburn has also been an opportunity for architecture overhauls, like incremental improvements of the text storage system, and the move to an entirely new media storage system to keep up with the growth of the content generated and curated by our contributors.

Wikimedia’s technical infrastructure aims to be as open and collaborative as the sites it powers. Most of the configuration of our servers is publicly accessible, and the Wikimedia Labs initiative allows contributors to test and submit improvements to the sites’ configuration files.

The Wikimedia Foundation currently operates a total of about 885 servers, and serves about 20 billion page views a month, on a non-profit budget that relies almost entirely on donations from readers.

Guillaume Paumier
Technical Communications Manager

 

We are donating servers

The Wikimedia Foundations has been upgrading and adding new servers for capacity and performance increases to meet the demands of our users. Due to this demand  we are soliciting request from other non-profit organizations that would like to acquire  some of our older systems that are no longer able to keep up with demand. These servers have been used by Wikimedia Foundation on different projects for over 3 years and are no longer covered by their warranty. These are good systems though, and while we may overload them and need replacements, they are more than suitable for many non-profits to use.

Close-up on a rack of Wikimedia servers in our datacenter in Ashburn, Virginia

We’re donating some of our older servers to like-minded US non-profits, who can apply by e-mail to show their interest.

Most systems (but possibly not all) have the following specifications:

  • Dual CPU 2.5 GHz Intel(R) Xeon(R) some may have AMD processors
  • 2-8GB RAM
  • Most servers have multiple HDD
  • A majority are manufactured by Dell
  • Should work fine but not guaranteed (see Disclaimers)

Disclaimers: The Wikimedia Foundation does not guarantee the operation or use of these servers in any shape or form. They are old, some may have dying fans, bad HDD sectors, and the like. Servers have been wiped of information, and they ran through that, but no promises on function! Also, most servers have rails, but occasionally one may not, and we do not sort through them for these things. However, most are standard Dell 1u servers and getting replacement rails is fairly simple. Some servers are well over 3 years old, we do not just turn off servers when they hit the 3 year mark, we turn them off when they are no longer worth using in any role or function on our cluster in a reliable manner. In most cases, it is simply the hardware technology has updated to the point that a new server is much faster, and since we demand high performance of our servers, it is worth upgrading for our needs.

At this time we are only able to donate these servers to U.S. based non-profits  whose core values are similar or in support of our own. This means we do not donate them for individual use. Since these servers were purchased with donations to support the Wikimedia Foundation, we feel we need to further donate them to other like-minded organizations, since that is how the money for the servers was meant to be spent. This means that we cannot, in good conscience, donate these servers for profit or personal use to individuals or corporations.

If you are a US organization and you would like to receive some of these servers for your NON-PROFIT use, please email servers@wikimedia.org, applications should only be made via email. Applications in this post’s comments will NOT be accepted.  TO BE ELIGIBLE, YOUR EMAIL MUST INCLUDE THE FOLLOWING:

  • Subject: Server Decommissioning Donations for <NONPROFIT NAME HERE>
  • Your name, contact information, relationship with non-profit requesting the servers
  • Registered non-profit name and proof of status
  • Organization’s website
  • Information on the non-profit, who they are, what their mission statement and goals are.
  • Shipping address information for a FedEx Ground delivery to where the servers need to go.
  • How the servers will be used. (We like to know and share with folks!)

Please keep in mind that deciding where these go is pretty tough, so the more detailed you can be in your email is best. (i.e. ‘Wikimedia uses these for our sites.’ is pretty vague where ‘Wikimedia is the non-profit foundation that runs Wikipedia. Server donations to us would be used to run our websites that allow access to Wikipedia and its sister projects.’ is a lot nicer. ;). Also, by submitting and possibly accepting servers from us, you are giving us permission to post about it here on our technical blog.

The submission period will remain open no less than two weeks from this posting.

Chris Johnson
Operations Engineer

Recovery of broken Gerrit repositories

As some of you may have noticed, yesterday our engineering team noticed that 16 of our Gerrit repositories were very badly broken. Their branches and tags all seemed to have vanished, along with their configuration (this is stored in a special branch on the repository itself). All of the repositories except one have been restored to their state as of about midnight UTC on Thursday, September 6. What follows is an in-depth analysis as to what happened and how I fixed it, along with some commentary about what I learned along the way.

(more…)

Wikimedia site outage, 6 August, 2012

Wikimedia sites experienced an outage today that started at about 6:15am PDT (13:15 UTC). Except for the mobile site, the sites were brought back up by 7:18am PDT (14:18 UTC). Mobile site services resumed at about 8:35 am PDT (15:35 UTC).

At about 6:15am PDT, we were alerted to a site issue and our team found severed network connectivity between our two data centers. Upon checking with our network provider, they informed us that the outage was caused by a fiber cut between the two data centers.

The data centers — one in Ashburn, Virginia and the other in Tampa, Florida — are connected by two separate fiber links (for redundancy). While Ashburn serves most of the traffic, it needs to talk to our Tampa data center for backend services (e.g. database).

We do operate two 10-g separate fibers between the data centers. We are now working with our network provider to determine how and why we were impacted by that fiber cut when we are supposed to have redundancy in our network. We are still waiting for their full report.

The team worked around the outage by rerouting traffic to Tampa, bypassing the Ashburn site. Connectivity was restored at about 8:35am PDT to one of the provider’s network links. The second link was restored at about 11:30am PDT (18:30 UTC). However, we have not reverted traffic back to Ashburn yet until we are comfortable with their fix. The switch back to Ashburn from Tampa should not be apparent to users.

UPDATE: Expanded report posted here: http://wikitech.wikimedia.org/view/Site_issue_Aug_6_2012

Please see status.wikimedia.org for site availability.

 CT Woo, Director of Technical Operations

Search restored after leap second bug

At midnight UTC on July 1, Wikimedia’s search cluster stopped working. A “leap second” inserted by the NTP daemon at that time caused Java processes to lock up, including our Lucene search system. The same bug affected many other websites. Our engineers restored service in less than two hours.

Leap seconds are added to our clocks once every few years so that the sun will be directly overhead of the Royal Observatory in Greenwich at precisely 12:00. Some people believe that the desire to keep these two time standards synchronised is anachronistic, and that it would be better to let them drift apart for 600 years and then add a single “leap hour”. I’m sure many computer engineers would breathe a sigh of relief if such a change were implemented.

Tim Starling, Lead Platform Architect

DigiCert partnership enhances SSL security on Wikimedia sites

The Wikimedia Foundation today announced a partnership with DigiCert, Inc. based in Linden, Utah, to secure its web and mobile properties, using the company’s Enterprise SSL Managed PKI. The agreement supports online authentication and encryption on Wikimedia’s web and mobile properties, while enabling Foundation staff to streamline digital certificate management.

“The Wikimedia Foundation is grateful for this partnership with DigiCert, which will enhance our ability to secure the millions of online exchanges that occur with our websites each day,” said CT Woo, Director of Technical Operations for the Wikimedia Foundation. “It’s important for Wikimedia to identify like-minded partners that value transparency and the privacy of our users.”

DigiCert is an online security provider for many of the most recognized companies and web sites in the world, including four of the top 10 comScore-ranked sites. With 489 million unique visitors to the 285 language Wikipedias and sister sites each month, the Wikimedia Foundation seeks partners who share its mission to ensure transparency and privacy for its users.

DigiCert has seen consistent growth over time and is currently the world’s third-largest provider of enterprise authentication services and digital certificates, with numerous government, educational and business clients around the world.

“DigiCert is pleased to partner with the Wikimedia Foundation in recognizing the importance of the free and secure flow of information across the Internet and to support the Foundation’s mission,” said DigiCert CEO Nicholas Hales in a press release. “We’re excited to have another opportunity to demonstrate the quality, scalability and flexibility of DigiCert’s products for a continually expanding roster of globally leading organizations of all sizes and industries.”

CT Woo, Director of Technical Operations

Opening our operations with Wikimedia Labs

For the past year and a half we’ve been working on a project named Wikimedia Labs, which enables us to invite our community to contribute to how our sites are run. Labs is a cloud computing environment using OpenStack for development, testing and deployment of Wikimedia’s infrastructure as a whole, enabling us to treat our infrastructure as an open source software project.

The problems we’re solving

When Wikipedia and its sister projects started, volunteers had root level access on our infrastructure. They were the only roots and most of the infrastructure they built is still in use today. Our lenient access policy made us flexible, so changes could happen quickly. Also, the sites were smaller, had far fewer users, and large, fundamental changes could be made in production.

Growth has made us less willing to give out root access to volunteers. Because of the size of our sites, downtime is less acceptable. But having fewer volunteers means we have less ideas, and due to that, our ability to make changes quickly is decreased. We haven’t had a new volunteer root in years. We haven’t even had a new volunteer with shell access. Engaging volunteers and enabling them to easily contribute is a wider problem as well.

Our software development community scales with volunteers. Unfortunately, operations doesn’t scale in a similar way right now. We’re limited to the staff operations engineers we currently have. The staff is great, but the fact that operations can’t scale to meet the needs of a large growth of developers means that operations is a bottleneck. Furthermore, our access policy prevents volunteer developers from learning how our infrastructure works.

This leads to a situation where our staff developers and volunteer developers can’t easily collaborate. Our volunteers also have no way of appropriately testing their changes, since our infrastructure is complex and difficult to replicate. This means it’s harder to take contributions, which further slows the pace of changes on our sites.

(more…)