Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

Wikimedia projects down due to power problem in primary data center

Starting at 0:10 UTC on July 5th, the Wikimedia Foundation suffered from
intermittent, partial power failures in the internal power network of
one of its main data centers in Tampa, Florida. Due to the temporary
unavailability of several critical systems and the large impact on the
available systems capacity, all Wikimedia projects went down. The power
situation stabilized at 1:12 UTC, and systems and services recovery has
been taking place since. We expect all projects to be back online and
editable around 4:00 UTC.

9 Responses to “Wikimedia projects down due to power problem in primary data center”

  1. camel cig says:

    There is no stand-by power curious?

  2. wieyoga says:

    James Salsman :
    (“…The server-room implications should be that redundancy will be” easier, I meant to include.)

    I’m susprised a professionally-managed datacentre like the one we use would have such problems.

  3. Michael Jux says:

    Having only ONE power supply for mission critical datacentres is generally NOT common.
    Starting with TIER 4 (Uptime’s Classification) you will have multiple power and cooling distribution parts(see: http://www.bitkom.org/files/documents/reliable_data_centers_guideline.pdf)
    – THAT’s STATE OF THE ART DATACENTRE DESIGN! #;-).

  4. (“…The server-room implications should be that redundancy will be” easier, I meant to include.)

  5. I was so glad to see support for offline editing in the budget from last week. That will support the dual use of allowing for mirroring peer-to-peer wikis hosting copies of projects (maybe including image bundles, if we’re lucky.) The server-room implications should be that redundancy will be

    The trick to making offline and peer-to-peer editing run smoothly is third party edit conflict resolution, which is remarkably similar in many ways to the pending changes extensions which recently went live, and also shares characteristics with WP:3O.

    And also, this outage seems to have been addressed very well. Kudos to the techs!

  6. Mike.lifeguard says:

    Jon :
    Any part of the power system in a colo _could_ fail, generally it is redundant and doesn’t fail, but even in the best of cases – things can go wrong. Heck, it is possible that a wire in the wall from the batteries to the servers melted down and failed.

    Yes, the reason I asked was precisely to find out what went wrong, not imply that nothing should ever go wrong :)

    Jon :
    As for redundant power supplies (on the machines themselves), that only helps in the case of A) having 2 completely separate power sources (generally not done)

    I’m pretty sure the colo does use two separate electricity providers.

  7. levu says:

    Mike.lifeguard :
    Are there not UPSes on the servers? Redundant power supplies? I’m susprised a professionally-managed datacentre like the one we use would have such problems.

    And they should also have enough oil down there in Florida to run them *g*

  8. Jon says:

    Mike.lifeguard :
    Are there not UPSes on the servers? Redundant power supplies? I’m susprised a professionally-managed datacentre like the one we use would have such problems.

    As someone who’s had this exact same problem inside of a professional colo, let me explain:

    Unlike your home, collocation facilities do not have batteries on every computer (Google is rumored to do this now, but that is neither here nor there). These facilities have giant batteries that run the entire facility. Though, generally the batteries are only big enough to keep the facility online for long enough to have the giant diesel generators come online. All of this, including generally multiple utility power lines, are run through a series of power distribution units.

    So when the power goes out at a colo, the batteries are wired in line, so they are already “running”. As soon as the input voltage drops, the PDU turns off utility power (spike protection), the diesel generator kicks on, then the PDU switches over to the generators. If this sounds easy… it is and it isn’t. Things can go wrong. In my personal experience, the colo I was in had a bank of batteries (inexplicably) die. That effectively brown-outed the datacenter.

    Any part of the power system in a colo _could_ fail, generally it is redundant and doesn’t fail, but even in the best of cases – things can go wrong. Heck, it is possible that a wire in the wall from the batteries to the servers melted down and failed.

    As for redundant power supplies (on the machines themselves), that only helps in the case of A) having 2 completely separate power sources (generally not done) or more common B) In case one of the power supplies IN SERVER fails. A lot of servers in data centers have “pig tail” power cables – 1 wall outlets to the 2 power supplies.

  9. Mike.lifeguard says:

    Are there not UPSes on the servers? Redundant power supplies? I’m susprised a professionally-managed datacentre like the one we use would have such problems.

Leave a Reply