As some of you may have noticed, yesterday our engineering team noticed that 16 of our Gerrit repositories were very badly broken. Their branches and tags all seemed to have vanished, along with their configuration (this is stored in a special branch on the repository itself). All of the repositories except one have been restored to their state as of about midnight UTC on Thursday, September 6. What follows is an in-depth analysis as to what happened and how I fixed it, along with some commentary about what I learned along the way.
Wikimedia sites experienced an outage today that started at about 6:15am PDT (13:15 UTC). Except for the mobile site, the sites were brought back up by 7:18am PDT (14:18 UTC). Mobile site services resumed at about 8:35 am PDT (15:35 UTC).
At about 6:15am PDT, we were alerted to a site issue and our team found severed network connectivity between our two data centers. Upon checking with our network provider, they informed us that the outage was caused by a fiber cut between the two data centers.
The data centers — one in Ashburn, Virginia and the other in Tampa, Florida — are connected by two separate fiber links (for redundancy). While Ashburn serves most of the traffic, it needs to talk to our Tampa data center for backend services (e.g. database).
We do operate two 10-g separate fibers between the data centers. We are now working with our network provider to determine how and why we were impacted by that fiber cut when we are supposed to have redundancy in our network. We are still waiting for their full report.
The team worked around the outage by rerouting traffic to Tampa, bypassing the Ashburn site. Connectivity was restored at about 8:35am PDT to one of the provider’s network links. The second link was restored at about 11:30am PDT (18:30 UTC). However, we have not reverted traffic back to Ashburn yet until we are comfortable with their fix. The switch back to Ashburn from Tampa should not be apparent to users.
UPDATE: Expanded report posted here: http://wikitech.wikimedia.org/view/Site_issue_Aug_6_2012
Please see status.wikimedia.org for site availability.
At midnight UTC on July 1, Wikimedia’s search cluster stopped working. A “leap second” inserted by the NTP daemon at that time caused Java processes to lock up, including our Lucene search system. The same bug affected many other websites. Our engineers restored service in less than two hours.
Leap seconds are added to our clocks once every few years so that the sun will be directly overhead of the Royal Observatory in Greenwich at precisely 12:00. Some people believe that the desire to keep these two time standards synchronised is anachronistic, and that it would be better to let them drift apart for 600 years and then add a single “leap hour”. I’m sure many computer engineers would breathe a sigh of relief if such a change were implemented.
Tim Starling, Lead Platform Architect
- One problem with the site since the deployment was a problem with our job queue, which meant that emails that were supposed to be sent from the site weren’t. This backlog was removed last night, and a lot of pent-up email was sent.
- There were some HTML cache invalidations that caused parts of the site to get overloaded for a few minutes.
- Yesterday, we started the deployment of the category sorting improvements. We deployed some modifications to the database today. This resulted in a few hiccups on the site that we’ve since mostly recovered from.
One key set of improvements in the MediaWiki 1.17 release is the category sorting work spearheaded by Aryeh Gregor. This code will eventually improve the sorting of categories in different languages, allowing us to choose the most appropriate sort order for the language. For now, we’re at least switching over to a more sensible sorting algorithm (Unicode Collation Algorithm (UCA)), and have made other improvements to sorting.
This set of changes required a modification of the database that we didn’t believe was risky, but was irreversible. Given how complicated the initial 1.17 deployment was, we decided to hold back on deploying this work.
There are still some maintenance scripts left to run before this work is fully-deployed, but most parts of this are done.
We’ve received many complaints about strange behavior on various wikis we host starting last night. These problems were directly related to an attempted deployment.
A bit of background about the 1.17 release:
- In Oct 2010 we committed to more frequent releases in response to community requests.
- Simultaneously, we committed to cutting through the backlog of code review requests from the community. As of this writing, the Code Review Team we formed has reduced the backlog of over 1400 un-reviewed core revisions down to zero in the 1.17 branch, as well as dispatching roughly 4000 other revisions in extensions (figuring out which ones we needed to review, and reviewing the important revisions there, too).
As is our usual practice, we review all code before trying to deploy it This practice has generally been good enough in the past that we have been able to quickly address anything we don’t catch in review within the first few minutes of deployment. The 1.17 release process has been longer than we would have liked, which has meant more code to review, and more likelihood for accumulating a critical mass of problems that would cause us to abort a deployment.
Our preparation for deployment uncovered a few issues, including a schema change, an update to the latest version of the diff utility and various other small issues which were discovered during the initial deployment to test.wikipedia.org. Pushing to test.wikipedia.org turns out to have been hugely useful, and in future we will take it as a lesson learned that any large deployment must successfully deploy to test.wikipedia.org at least 24 hours prior to general deployment.
When we finally deployed last night, our Apaches started complaining pretty much immediately. We rolled back to the previous version, worked on debugging and thought we had a suitable fix. We attempted deployment again but found the same issue very quickly. What we discovered was that our cache miss rate went from roughly 22% with the old version of the software (1.16) to about 45% with 1.17. The higher miss rate increased the load on our Apaches to the point where they couldn’t keep up, at which point they start behaving unpredictably. This can cause cascading failures (for example, caching bad data served by overloaded Apaches), and can result in strange layout problems and other issues that many people witnessed today.
By the way, whenever we do a large deployment, a number of WMF staff and community developers meet online to work through any issues that might arise. We schedule deployments late at night in the US to take advantage of lulls in request traffic, so everybody is working late. By the second failure, these people had been awake for many hours and we started to be concerned about their ability to work efficiently on little sleep, so I vetoed further attempts at deployment today.
We are currently combing the logs for further clues about how to mitigate risks of a similar outcome when we next attempt to deploy 1.17, which most likely won’t happen until later this week (at the earliest). We’re are also closely investigating the check-ins related to parsing and caching, and evaluating our profiling data. We plan to regroup tomorrow, decide how confident we are in the fixes we are able to implement in the past 24 hours, and make a decision as to when we should target to deploy.
Today at 20:00 UTC we saw a traffic surge on our load balancing and caching infrastructure, resulting in intermittent outages in Wikipedia service worldwide. This was due to a complex interaction of factors, including issues in our Amsterdam caching center and the Fundraiser launch, which has generated much more than expected interest today. We switched all traffic to Tampa, which experienced service problems due to high traffic and the additional load. Currently service is fully recovered worldwide, and we are continuing to closely monitor all systems.
CTO, Wikimedia Foundation
Around 18:00 UTC today, all Wikimedia projects experienced an unplanned outage caused by a cascade of events originating with the Image Scalers and eventually spreading through our web servers and load balancers due to an apparent bug in PyBal code. Situation was remedied by restarting key servers and rebalancing the load between subsystems. Full services availability was restored at 19:30 UTC.
At 10:57 UTC, the master database server for s3 (the cluster that holds most of our wikis) had a full disk and stopped writing. For this reason it was no longer possible to edit these wikis. The larger wikis live on separate clusters and were not affected.
After switching to another master database, all wikis are back up and editable as of 12:02 UTC. A few edits that were made during the incident may have been lost.
Starting at 0:10 UTC on July 5th, the Wikimedia Foundation suffered from
intermittent, partial power failures in the internal power network of
one of its main data centers in Tampa, Florida. Due to the temporary
unavailability of several critical systems and the large impact on the
available systems capacity, all Wikimedia projects went down. The power
situation stabilized at 1:12 UTC, and systems and services recovery has
been taking place since. We expect all projects to be back online and
editable around 4:00 UTC.
Due to an overheating problem in our European data center many of our servers turned off to protect themselves. As this impacted all Wikipedia and other projects access from European users, we were forced to move all user traffic to our Florida cluster, for which we have a standard quick failover procedure in place, that changes our DNS entries.
However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally. This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects.
We apologize for the inconvenience this has caused.
Update: Unfortunately, for many, this outage seems to have lasted longer than an hour. It appears that many ISPs’ DNS resolvers do not honor the so-called Negative Cache TTL that we send (1 hour), and instead use a longer value. We have circumvented this problem by renaming the affected DNS record to something else.
Update 21:32 UTC: Our SSL gateway, secure.wikimedia.org, was disabled due to overload issues, but is now back up.