Wikimedia blog

News from inside the Wikimedia Foundation.org

Operations

Ever wondered how the Wikimedia servers are configured?

Well, wonder no longer! To configure the Wikimedia servers, we use Puppet, a configuration management system, which lets us write code that manages all of our servers like a single large application. Of course, to really know how our servers are configured, you’d need to see our Puppet configuration.

Good news: we’ve just released our Puppet configuration in a public Git repository.

What is and isn’t included

Basically everything is included in the repository. We spent a few weeks removing private and sensitive things from the repository, though. We have these in a private repository that is only available to Wikimedia staff and volunteers with root access.

This, of course, means that the puppet configuration, as released, won’t completely work. The public repository makes references to files and manifests in the private repository. To make the repository work, you’ll need to fill in the missing information. There isn’t very much in the private repository, though, so that task should be fairly easy.

The point of making this repository public

We have a couple reasons for making this repository public:

  1. It shares knowledge with the world
  2. It lets us treat operations like a software development project

Both reasons align with our mission, but we were already mostly sharing this knowledge via wikitech. The second reason aligns more closely with our mission, as it allows us to let the world be directly involved in our operations efforts.

Labs and community oriented operations

The release of this Puppet repository is the first step in the Wikimedia Test/Dev Labs project. We’ll be going further than just making the repository readable by the world. Part of the Test/Dev Labs project is to create a clone of our production cluster. This clone will run a branch of the puppet repository.

Staff and community developers, and staff and community operations engineers will be able to push changes to the test branch of the Puppet repository, which will manage the cloned cluster. They’ll then be able to push these changes for review to the production branch of the Puppet repository. The staff operations engineers can then code-review the changes and push the changes out to the production systems.

Like the Wikimedia content, the site interface, and the site’s software (MediaWiki), community members will be able to edit the site’s architecture as well.

Accessing the repository

Since this is a public Git repository, you can do an anonymous git clone like so:

git clone https://gerrit.wikimedia.org/r/p/operations/puppet

You can browse the repository through the gitweb interface. You can see the code review activity via Gerrit.

Ryan Lane
Operations Engineer

Does your Wikipedia mobile App expect our full content layout?

If so we have an upcoming change this week that you should be aware of. We’re in the final part of our new device detection testing that will automatically redirect any mobile agent we recognize over to its corresponding .m mobile gateway.This means that if your app declares a mobile UA as recognized by WURFL and connects directly to us we will redirect that traffic to .m.wikipedia.org and NOT .wikipedia.org.
Those apps that use an intermediate gateway which don’t have a mobile user agent will not be affected. If on the other hand your app does all of your logic then you will need to explicitly identify your UA to us.  Or, ensure that your UA contains “bot” to bypass redirection.

If this is not the behavior that you want then please let us know at know on meta or come find us on freenode #wikimedia-mobile.

Tomasz Finc

Director of Mobile and Special Projects

Protocol relative URLs enabled on test.wikipedia.org

In preparation for enabling HTTPS on Wikimedia Foundation sites, we’ve recently enabled protocol relative URLs on test.wikipedia.org. Protocol relative URLs are needed to make the site work properly in both HTTP and HTTPS modes.

What are protocol relative URLs?

Normal URLs look like: http://test.wikipedia.org/wiki/Main_Page or https://test.wikipedia.org/wiki/Main_Page. Both of these URLs define the protocol that will be used. Protocol relative URLs look like this: //test.wikipedia.org/wiki/Main_Page. Dropping the protocol from the URL allows the browser to assign the current protocol to the URL. So, if you are visiting the site in HTTPS mode, links will point to HTTPS, and if you are visiting the site in HTTP mode, links will point to HTTP.

Why are protocol relative URLs needed?

We need to use protocol relative URLs for a couple reasons:

  1. All requests are served by our caching layer (squid or varnish). If you are browsing the site in HTTPS mode, and another user is browsing the same pages in HTTP mode, two versions of those pages will be stored in our cache, as the links are different between the two modes. This splits our cache, which makes it less efficient and more expensive to operate.
  2. When browsing in HTTPS mode, we want to ensure links point to the correct protocol. When pages are parsed, things like interwiki links are created by the parser. If we do not use protocol relative URLs, then links will point to either HTTPS or HTTP, which will cause users to switch modes randomly.

How does this affect me?

It shouldn’t. Things should continue to work as before. We are currently testing this out on some internal wikis, and have enabled it on test.wikipedia.org so that the entire community will have a couple weeks to test it out before we enable it on all projects.

API users, especially, should test thoroughly. The API, in most cases, will not output protocol-relative URLs, but will continue to output http:// URLs no matter whether you call it over HTTP or HTTPS. This is because we don’t expect API clients to be able to resolve protocol relative URLs correctly, and that the context of these URLs (which is needed to resolve them) will frequently get lost along the way.

The exceptions to this are:

  • HTML produced by the parser will have protocol-relative URLs in <a href=”…”> tags etc.
  • prop=extlinks and list=exturlusage will output URLs verbatim as they appear in the article, which means they may output protocol-relative URLs

If you are getting protocol-relative URLs in some other place in the API, that’s likely a bug.

If you notice any issues related to protocol relative URLs, in the API or not, please let us know.

Note: we’ve also enabled HTTPS on test.wikipedia.org; so, please do test protocol relative URLs in HTTP and HTTPS modes. There is at least one known bug with regards to HTTPS mode and redirects, which will be fixed soon. More to come on this in a later post.

Ryan Lane

Server Decommission Donations

At this time we have closed submissions.  We have received well over 100 requests, and will not have enough servers to cover those, let alone more.  Thanks for the submissions!  ~ RobH @ 2011-06-18 @ 10:00 EST

Due to the overwhelming response, we will unfortunately not be able to reply to everyone on an individual basis.  If your organization is selected, you will receive an email from us indicating the approval, as well as shipping information. ~ RobH @ 2011-06-30 @ 13.30 EST

 

Wikimedia Foundation  has  been upgrading and adding new servers to keep up with traffic demand and capacity growth as we always do. Recently, we replaced some of our older servers with faster, higher capacity  and more energy efficient servers.  These  older servers are now decommissioned and will be donated away. Do note  that they are over 3+ years old and are out of warranty.  While we may have placed a lot of demand on them  over the years, they are in fine working condition.

Most systems (but possibly not all) have the following specifications:

  • Dual CPU 2.5 GHz
  • From 3GB to 24GB of RAM, depending on role.
  • Most have 80 GB or larger HDD (some have two hard drives, some drives are 160GB or possibly even 250GB)

If you are interested, please provide the following information in your email to us:

  • Registered non-profit name and information.
  • Your contact information, including email address, phone number, and relationship with requesting non-profit.
  • Information on the non-profit, their charter,  mission  and goals.
  • Shipping address information for a FedEx Ground delivery (i.e., the shipment destination)*
  • How the servers will be used.  (We like to know and share with folks!)

* At this time we regret that we are only able to ship servers to USA based non-profits.  This is due to the cost of shipping and the various exportation laws and taxes that result from shipping internationally.

Please provide as much detail as possible on how you plan to use the servers. For example,   ‘Wikimedia will use these for our sites.’ is pretty vague where as  ‘Wikimedia is the non-profit foundation that runs Wikipedia.  Server donations to us would be used to run our websites that allow access to Wikipedia and its sister projects.’ is much clearer.

If you are not a registered non-profit, your use of the server(s) must be utilized in a fashion that works with or on the projects of the Wikimedia Foundation.  We are not donating these servers to private individuals for personal use.  All requests that are not for use on Wikimedia projects or are not going to a non-profit will be ignored.

By submitting and possibly accepting servers from us, you are granting the Wikimedia Foundation permission to publish details of the donation.  This is normally (but not limited to) a quick blurb about it on our Tech Blog (http://techblog.wikimedia.org).

The Wikimedia Foundation provides no guarantee of the hardware donated in any manner.  Any use of the hardware is not the responsibility of the Wikimedia Foundation.

All requests will be reviewed by our technical team, and they will reply back regarding server availability.  Please keep in mind that these are handled on a low priority schedule, with our normal operations taking precedence.  There may be delays in shipping out your request, or we simply run out of servers.

At this time we have closed submissions.  We have received well over 100 requests, and will not have enough servers to cover those, let alone more.  Thanks for the submissions!  ~ RobH @ 2011-06-18 @ 10:00 EST

Rob Halsell

Wikimedia Operations Engineer

Thumbnail issues being resolved

Last Monday, our Solaris server that contains all image thumbnails developed problems. It ran out of memory, became too slow and eventually even started to crash. (For the technically inclined: we think the kernel is leaking some file system structure in kernel memory.) This caused missing thumbnails across Wikimedia projects.

We addressed these problems in the following ways:

  • We decreased the load on this server by adapting the Squid configuration, so it would have to handle fewer requests.
  • We ordered more memory, in order to double the total physical memory in the relevant systems.
  • We set up two new Linux servers that will eventually replace the Solaris server.

At first, the addition of these Linux servers in a partially caching setup seemed enough to fix the immediate problem, while gradually copying all thumbnail files, allowing us to replace the Solaris server completely.

However, on Saturday night the Solaris server started crashing repeatedly, making it necessary to engage the image scalers to regenerate a large part of the missing thumbnails. This is causing some slowness of loading and generating new (uncached) thumbnails.

Fortunately, most users have not experienced serious problems while using the site, since most thumbnails are cached by our HTTP caching layer. It is impossible to determine exactly how long it will take to recover completely from the slower service, but we expect that this will take no more than a few days.

Over the past months we have been developing a new and more scalable architecture for media storage, which will solve these problems once and for all. We hope to deploy this new architecture within a few months, also utilizing the new data center. Please watch the Tech Blog for updates on this project.

Site fixes this week

We’re still in the middle of cleaning up some lingering issues from the 1.17 deployment, and despite our best efforts, you may see a little bit of quirkiness in the site:
  • One problem with the site since the deployment was a problem with our job queue, which meant that emails that were supposed to be sent from the site weren’t.  This backlog was removed last night, and a lot of pent-up email was sent.
  • There were some HTML cache invalidations that caused parts of the site to get overloaded for a few minutes.
  • Yesterday, we started the deployment of the category sorting improvements.  We deployed some modifications to the database today.  This resulted in a few hiccups on the site that we’ve since mostly recovered from.
Category collation

One key set of improvements in the MediaWiki 1.17 release is the category sorting work spearheaded by Aryeh Gregor. This code will eventually improve the sorting of categories in different languages, allowing us to choose the most appropriate sort order for the language. For now, we’re at least switching over to a more sensible sorting algorithm (Unicode Collation Algorithm (UCA)), and have made other improvements to sorting.

This set of changes required a modification of the database that we didn’t believe was risky, but was irreversible. Given how complicated the initial 1.17 deployment was, we decided to hold back on deploying this work.

There are still some maintenance scripts left to run before this work is fully-deployed, but most parts of this are done.

Other fixes
We’re also aware of and working on other problems with the job queue. We’re investigating these problems and hope to have these fixed soon.

Wikimedia selects Watchmouse for global monitoring services

Earlier today we announced our selection of Watchmouse website monitoring to assist both the Foundation and anyone around the world in keeping an eye on our server uptime and status.  With Watchmouse’s help, the Foundation now has a public status page, which is maintained offsite on servers independent from Wikimedia, that reports our uptime and accessibility levels from over 50 locations around the world. The service breaks out each of the primary server systems of the Foundation, because it definitely takes more than one computer to keep us up and running.

This is the first time Wikimedia has offered a publicly visible, externally hosted website monitoring service. Uptime is of course critical for reaching all of Wikimedia’s users, but also for ensuring that our wikis are open and editable to everyone, all the time.

With a rapidly growing, and global, audience of hundreds of millions of readers and contributors, Wikimedia’s properties have become an integral part of how the world accesses and shares knowledge.  This new service is particularly important as the Foundation establishes its permanent data center infrastructure, and looks beyond the US and Europe to establish more data centers (more regular updates from our engineering team can be found on the Wikimedia tech blog). Publicly sharing where downtime (and uptime, of course) is being experienced also helps us maintain our mission focus on transparency and accessibility.

Thanks for joining us as mission supporters, Watchmouse!

Jay Walsh, Communications

Post Mortem on last night’s 1.17 deployment attempts…

We’ve received many complaints about strange behavior on various wikis we host starting last night. These problems were directly related to an attempted deployment.

A bit of background about the 1.17 release:

  • In Oct 2010 we committed to more frequent releases in response to community requests.
  • Simultaneously, we committed to cutting through the backlog of code review requests from the community. As of this writing, the Code Review Team we formed has reduced the backlog of over 1400 un-reviewed core revisions down to zero in the 1.17 branch, as well as dispatching roughly 4000 other revisions in extensions (figuring out which ones we needed to review, and reviewing the important revisions there, too).
  • 1.17 was an omnibus collection of fixes, including a large number of patches which had been waiting for review for a long time. The Foundation’s big contribution to the release was the ResourceLoader, a piece of MediaWiki infrastructure that allows for on-demand loading of JavaScript. Many other incremental improvements were made in how MediaWiki parses and caches pages and page fragments.

As is our usual practice, we review all code before trying to deploy it This practice has generally been good enough in the past that we have been able to quickly address anything we don’t catch in review within the first few minutes of deployment. The 1.17 release process has been longer than we would have liked, which has meant more code to review, and more likelihood for accumulating a critical mass of problems that would cause us to abort a deployment.

Our preparation for deployment uncovered a few issues, including a schema change, an update to the latest version of the diff utility and various other small issues which were discovered during the initial deployment to test.wikipedia.org. Pushing to test.wikipedia.org turns out to have been hugely useful, and in future we will take it as a lesson learned that any large deployment must successfully deploy to test.wikipedia.org at least 24 hours prior to general deployment.

When we finally deployed last night, our Apaches started complaining pretty much immediately. We rolled back to the previous version, worked on debugging and thought we had a suitable fix. We attempted deployment again but found the same issue very quickly. What we discovered was that our cache miss rate went from roughly 22% with the old version of the software (1.16) to about 45% with 1.17. The higher miss rate increased the load on our Apaches to the point where they couldn’t keep up, at which point they start behaving unpredictably. This can cause cascading failures (for example, caching bad data served by overloaded Apaches), and can result in strange layout problems and other issues that many people witnessed today.

By the way, whenever we do a large deployment, a number of WMF staff and community developers meet online to work through any issues that might arise. We schedule deployments late at night in the US to take advantage of lulls in request traffic, so everybody is working late. By the second failure, these people had been awake for many hours and we started to be concerned about their ability to work efficiently on little sleep, so I vetoed further attempts at deployment today.

We are currently combing the logs for further clues about how to mitigate risks of a similar outcome when we next attempt to deploy 1.17, which most likely won’t happen until later this week (at the earliest). We’re are also closely investigating the check-ins related to parsing and caching, and evaluating our profiling data. We plan to regroup tomorrow, decide how confident we are in the fixes we are able to implement in the past 24 hours, and make a decision as to when we should target to deploy.

Planned deployment of 1.17 branch on February 8

The engineering team is busy working on the deployment of the 1.17 branch of MediaWiki.  We plan to roll this out next week to all languages and projects, Tuesday, February 8, with work starting at 07:00 UTC (which is 11pm on Monday, February 7 for San Francisco).

If all goes well, you should only notice the improvement. If it doesn’t go well, that’s because there’s something we missed, and that’s where we’d love your help.  Please help us test this release! We have a test instance of the software we plan to deploy available at prototype.wikimedia.org.  If you find issues, please report them in Bugzilla.

There are many, many little fixes and improvements that have gone into 1.17 (see the draft release notes for an exhaustive list) .  There isn’t much that’s visible to users of the site, but one under the hood improvement that should result in some speed improvements: Resource Loader.  Resource Loader optimizes the use of JavaScript in MediaWiki, speeding up delivery of JavaScript by compressing it sometimes, and cutting down on the amount of unused JavaScript that gets delivered to the browser in the first place.  Much of the work in this development cycle has been centered on ensuring compatibility with the new system.  Since it makes such a large shift in the way that JavaScript is delivered to the browser, it’s also an operational aspect we’ll be keeping a close eye on, as load shifts between servers in our infrastructure.

Note that this isn’t a release for download, yet.  On and after February 8, the “latest” version of MediaWiki will still be 1.16 as listed on mediawiki.org.  We plan to update this to 1.17 sometime after the deployment of the 1.17 branch, after we’ve had time to run it in production for a while and fix the issues we’re likely to find.

So please, help us test this release, and if you find bugs, please report them in Bugzilla.  Thanks!

11-15-10 Outage

Today at 20:00 UTC we saw a traffic surge on our load balancing and caching infrastructure, resulting in intermittent outages in Wikipedia service worldwide. This was due to a complex interaction of factors, including issues in our Amsterdam caching center and the Fundraiser launch, which has generated much more than expected interest today. We switched all traffic to Tampa, which experienced service problems due to high traffic and the additional load. Currently service is fully recovered worldwide, and we are continuing to closely monitor all systems.

Danese Cooper
CTO, Wikimedia Foundation