Wikimedia site outage, 6 August, 2012

Wikimedia sites experienced an outage today that started at about 6:15am PDT (13:15 UTC). Except for the mobile site, the sites were brought back up by 7:18am PDT (14:18 UTC). Mobile site services resumed at about 8:35 am PDT (15:35 UTC).

At about 6:15am PDT, we were alerted to a site issue and our team found severed network connectivity between our two data centers. Upon checking with our network provider, they informed us that the outage was caused by a fiber cut between the two data centers.

The data centers — one in Ashburn, Virginia and the other in Tampa, Florida — are connected by two separate fiber links (for redundancy). While Ashburn serves most of the traffic, it needs to talk to our Tampa data center for backend services (e.g. database).

We do operate two 10-g separate fibers between the data centers. We are now working with our network provider to determine how and why we were impacted by that fiber cut when we are supposed to have redundancy in our network. We are still waiting for their full report.

The team worked around the outage by rerouting traffic to Tampa, bypassing the Ashburn site. Connectivity was restored at about 8:35am PDT to one of the provider’s network links. The second link was restored at about 11:30am PDT (18:30 UTC). However, we have not reverted traffic back to Ashburn yet until we are comfortable with their fix. The switch back to Ashburn from Tampa should not be apparent to users.

UPDATE: Expanded report posted here: http://wikitech.wikimedia.org/view/Site_issue_Aug_6_2012

Please see status.wikimedia.org for site availability.

 CT Woo, Director of Technical Operations
Categories: Operations, Outage
Categories:
3 Show

3 Comments on Wikimedia site outage, 6 August, 2012

Casey Brown 2 years

@jolison (comment #2): We all agree that any outage is unacceptable. As of December 2011, the average uptime was 99.97% though, so it seems the sysadmins are actually doing a great job. We can try as hard as hard as we can to minimize downtime—all sites do that—but unfortunately sometimes there are circumstances beyond the system administrators’ control. As CT said, the Foundation is paying to have redundant cables that would prevent errors like this, so the amount of money allocated to tech isn’t at issue here. It seems the issue was with the third-party company that manages the datacenter, an issue that we wouldn’t have known about without this happening. Hopefully issues like this can be avoided in the future.

jolison 2 years

With a budget of over 20 millions of dollars coming from the fundraiser this is pretty much unacceptable. 90% of the money should go to technology aspects like this. Outage is not an option.

CT Woo 2 years

An update with root cause is now available – http://wikitech.wikimedia.org/view/Site_issue_Aug_6_2012 ,

CT Woo

Leave a Reply

Your email address will not be published. Required fields are marked *