Wikimedia blog

News from inside the Wikimedia Foundation.org

Outage

Site fixes this week

We’re still in the middle of cleaning up some lingering issues from the 1.17 deployment, and despite our best efforts, you may see a little bit of quirkiness in the site:
  • One problem with the site since the deployment was a problem with our job queue, which meant that emails that were supposed to be sent from the site weren’t.  This backlog was removed last night, and a lot of pent-up email was sent.
  • There were some HTML cache invalidations that caused parts of the site to get overloaded for a few minutes.
  • Yesterday, we started the deployment of the category sorting improvements.  We deployed some modifications to the database today.  This resulted in a few hiccups on the site that we’ve since mostly recovered from.
Category collation

One key set of improvements in the MediaWiki 1.17 release is the category sorting work spearheaded by Aryeh Gregor. This code will eventually improve the sorting of categories in different languages, allowing us to choose the most appropriate sort order for the language. For now, we’re at least switching over to a more sensible sorting algorithm (Unicode Collation Algorithm (UCA)), and have made other improvements to sorting.

This set of changes required a modification of the database that we didn’t believe was risky, but was irreversible. Given how complicated the initial 1.17 deployment was, we decided to hold back on deploying this work.

There are still some maintenance scripts left to run before this work is fully-deployed, but most parts of this are done.

Other fixes
We’re also aware of and working on other problems with the job queue. We’re investigating these problems and hope to have these fixed soon.

Post Mortem on last night’s 1.17 deployment attempts…

We’ve received many complaints about strange behavior on various wikis we host starting last night. These problems were directly related to an attempted deployment.

A bit of background about the 1.17 release:

  • In Oct 2010 we committed to more frequent releases in response to community requests.
  • Simultaneously, we committed to cutting through the backlog of code review requests from the community. As of this writing, the Code Review Team we formed has reduced the backlog of over 1400 un-reviewed core revisions down to zero in the 1.17 branch, as well as dispatching roughly 4000 other revisions in extensions (figuring out which ones we needed to review, and reviewing the important revisions there, too).
  • 1.17 was an omnibus collection of fixes, including a large number of patches which had been waiting for review for a long time. The Foundation’s big contribution to the release was the ResourceLoader, a piece of MediaWiki infrastructure that allows for on-demand loading of JavaScript. Many other incremental improvements were made in how MediaWiki parses and caches pages and page fragments.

As is our usual practice, we review all code before trying to deploy it This practice has generally been good enough in the past that we have been able to quickly address anything we don’t catch in review within the first few minutes of deployment. The 1.17 release process has been longer than we would have liked, which has meant more code to review, and more likelihood for accumulating a critical mass of problems that would cause us to abort a deployment.

Our preparation for deployment uncovered a few issues, including a schema change, an update to the latest version of the diff utility and various other small issues which were discovered during the initial deployment to test.wikipedia.org. Pushing to test.wikipedia.org turns out to have been hugely useful, and in future we will take it as a lesson learned that any large deployment must successfully deploy to test.wikipedia.org at least 24 hours prior to general deployment.

When we finally deployed last night, our Apaches started complaining pretty much immediately. We rolled back to the previous version, worked on debugging and thought we had a suitable fix. We attempted deployment again but found the same issue very quickly. What we discovered was that our cache miss rate went from roughly 22% with the old version of the software (1.16) to about 45% with 1.17. The higher miss rate increased the load on our Apaches to the point where they couldn’t keep up, at which point they start behaving unpredictably. This can cause cascading failures (for example, caching bad data served by overloaded Apaches), and can result in strange layout problems and other issues that many people witnessed today.

By the way, whenever we do a large deployment, a number of WMF staff and community developers meet online to work through any issues that might arise. We schedule deployments late at night in the US to take advantage of lulls in request traffic, so everybody is working late. By the second failure, these people had been awake for many hours and we started to be concerned about their ability to work efficiently on little sleep, so I vetoed further attempts at deployment today.

We are currently combing the logs for further clues about how to mitigate risks of a similar outcome when we next attempt to deploy 1.17, which most likely won’t happen until later this week (at the earliest). We’re are also closely investigating the check-ins related to parsing and caching, and evaluating our profiling data. We plan to regroup tomorrow, decide how confident we are in the fixes we are able to implement in the past 24 hours, and make a decision as to when we should target to deploy.

11-15-10 Outage

Today at 20:00 UTC we saw a traffic surge on our load balancing and caching infrastructure, resulting in intermittent outages in Wikipedia service worldwide. This was due to a complex interaction of factors, including issues in our Amsterdam caching center and the Fundraiser launch, which has generated much more than expected interest today. We switched all traffic to Tampa, which experienced service problems due to high traffic and the additional load. Currently service is fully recovered worldwide, and we are continuing to closely monitor all systems.

Danese Cooper
CTO, Wikimedia Foundation

10/10/10 Outage

Around 18:00 UTC today, all Wikimedia projects experienced an unplanned outage caused by a cascade of events originating with the Image Scalers and eventually spreading through our web servers and load balancers due to an apparent bug in PyBal code. Situation was remedied by restarting key servers and rebalancing the load between subsystems. Full services availability was restored at 19:30 UTC.

Database errors on most Wikipedias

At 10:57 UTC, the master database server for s3 (the cluster that holds most of our wikis) had a full disk and stopped writing. For this reason it was no longer possible to edit these wikis. The larger wikis live on separate clusters and were not affected.

After switching to another master database, all wikis are back up and editable as of 12:02 UTC. A few edits that were made during the incident may have been lost.

Wikimedia projects down due to power problem in primary data center

Starting at 0:10 UTC on July 5th, the Wikimedia Foundation suffered from
intermittent, partial power failures in the internal power network of
one of its main data centers in Tampa, Florida. Due to the temporary
unavailability of several critical systems and the large impact on the
available systems capacity, all Wikimedia projects went down. The power
situation stabilized at 1:12 UTC, and systems and services recovery has
been taking place since. We expect all projects to be back online and
editable around 4:00 UTC.

Global Outage (cooling failure and DNS)

Due to an overheating problem in our European data center many of our servers turned off to protect themselves. As this impacted all Wikipedia and other projects access from European users, we were forced to move all user traffic to our Florida cluster, for which we have a standard quick failover procedure in place, that changes our DNS entries.

However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally. This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects.

We apologize for the inconvenience this has caused.

Update: Unfortunately, for many, this outage seems to have lasted longer than an hour. It appears that many ISPs’ DNS resolvers do not honor the so-called Negative Cache TTL that we send (1 hour), and instead use a longer value. We have circumvented this problem by renaming the affected DNS record to something else.

Update 21:32 UTC: Our SSL gateway, secure.wikimedia.org, was disabled due to overload issues, but is now back up.

ESAMS Servers not reachable, some EU traffic affected. (Fixed!)

Starting approx. 03:20 GMT, servers in our ESAMS facility began to roll offline one after another.  After some investigation, it appea

rs power is not being supplied to all the servers.  This has resulted in some slow downs for traffic of EU users.

120px-Gnome-face-sick.svg

We have temporarily migrated all traffic to our primary FL datacenter.  Once the servers are

back online in ESAMS, we will be pushing service back to it as well.

Update: The problem has been identified and finally fixed. Traffic has been returned to normal.

The best guess so far is that there was a cooling failure in the datacenter which caused the Sun boxes to shut themselves down.

An update from Leaseweb/Evoswitch is here:  http://noc.leaseweb.com/status.php?i=389

PDF Export currently down (fixed)

Our PDF export server is presently down.  It had to be rebooted to organize and route some power cables in our racks.  When it powered back on, it is failing to load all software correctly.  We are working on resolving it, I just wanted to post something here on the blog since it is the first place that many people check when they think some service is broken.

Power outage in Wikimedia’s European servers

This seems to be a power outage at our European proxy caching cluster; we’ll see if we can give more details later.

deadeuro-reqstats-hourly

European traffic has been rerouted to our US servers, but the extra load may cause the sites to be a little sluggish for now. (If your DNS is still seeing the old entries, you can manually configure your browser to use the US proxy: rr.pmtpa.wikimedia.org port 80. You should only do this temporarily, as you won’t be able to access anything *but* Wikipedia and our sister projects. :)

Update 21:13 UTC:

European servers are coming back online, we should have this cleaned up pretty soon.

Update 21:26 UTC:

We’re starting to switch traffic back to Europe. Should be better in a few minutes… In the meantime, amuse yourself reading the Twitter panic. :)

Update 21:40 UTC:

You can also use the SSL interface to Wikipedia, which doesn’t have the proxy overload.