Wikimedia sites experienced an outage today that started at about 6:15am PDT (13:15 UTC). Except for the mobile site, the sites were brought back up by 7:18am PDT (14:18 UTC). Mobile site services resumed at about 8:35 am PDT (15:35 UTC).
At about 6:15am PDT, we were alerted to a site issue and our team found severed network connectivity between our two data centers. Upon checking with our network provider, they informed us that the outage was caused by a fiber cut between the two data centers.
The data centers — one in Ashburn, Virginia and the other in Tampa, Florida — are connected by two separate fiber links (for redundancy). While Ashburn serves most of the traffic, it needs to talk to our Tampa data center for backend services (e.g. database).
We do operate two 10-g separate fibers between the data centers. We are now working with our network provider to determine how and why we were impacted by that fiber cut when we are supposed to have redundancy in our network. We are still waiting for their full report.
The team worked around the outage by rerouting traffic to Tampa, bypassing the Ashburn site. Connectivity was restored at about 8:35am PDT to one of the provider’s network links. The second link was restored at about 11:30am PDT (18:30 UTC). However, we have not reverted traffic back to Ashburn yet until we are comfortable with their fix. The switch back to Ashburn from Tampa should not be apparent to users.
UPDATE: Expanded report posted here: http://wikitech.wikimedia.org/view/Site_issue_Aug_6_2012
Please see status.wikimedia.org for site availability.