Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

Outage

ESAMS Servers not reachable, some EU traffic affected. (Fixed!)

Starting approx. 03:20 GMT, servers in our ESAMS facility began to roll offline one after another.  After some investigation, it appea

rs power is not being supplied to all the servers.  This has resulted in some slow downs for traffic of EU users.

120px-Gnome-face-sick.svg

We have temporarily migrated all traffic to our primary FL datacenter.  Once the servers are

back online in ESAMS, we will be pushing service back to it as well.

Update: The problem has been identified and finally fixed. Traffic has been returned to normal.

The best guess so far is that there was a cooling failure in the datacenter which caused the Sun boxes to shut themselves down.

An update from Leaseweb/Evoswitch is here:  http://noc.leaseweb.com/status.php?i=389

Rob Halsell, Operations Engineer

PDF Export currently down (fixed)

Our PDF export server is presently down.  It had to be rebooted to organize and route some power cables in our racks.  When it powered back on, it is failing to load all software correctly.  We are working on resolving it, I just wanted to post something here on the blog since it is the first place that many people check when they think some service is broken.

Power outage in Wikimedia’s European servers

This seems to be a power outage at our European proxy caching cluster; we’ll see if we can give more details later.

deadeuro-reqstats-hourly

European traffic has been rerouted to our US servers, but the extra load may cause the sites to be a little sluggish for now. (If your DNS is still seeing the old entries, you can manually configure your browser to use the US proxy: rr.pmtpa.wikimedia.org port 80. You should only do this temporarily, as you won’t be able to access anything *but* Wikipedia and our sister projects. :)

Update 21:13 UTC:

European servers are coming back online, we should have this cleaned up pretty soon.

Update 21:26 UTC:

We’re starting to switch traffic back to Europe. Should be better in a few minutes… In the meantime, amuse yourself reading the Twitter panic. :)

Update 21:40 UTC:

You can also use the SSL interface to Wikipedia, which doesn’t have the proxy overload.

Brion Vibber, Lead Software Architect

Downtime on en.wikipedia.org resolved

We had 52 minutes of downtime on the English-language Wikipedia site today; only en.wikipedia.org was affected. Our master database server was thrown into a funky state in which hundreds of access threads were stuck in the “statistics” state — which seems to be MySQL’s way of saying “I’ve fallen and I can’t get up”.

It’s unclear exactly what set it off, but basically nothing works until you restart MySQL. After switching the site to an alternate master database, all has been well.

At 52 minutes from start of event, this took us a bit longer than I’d like to resolve — we had to percolate through a couple levels of alert calls before we finished diagnosing it and getting the DB switch pushed through. (Sorry to wake you up early Tim!)

A similar event in future should be fixable within a few minutes, thanks to Tim’s work on making the master-switch system more foolproof. We’re fixing up our internal documentation so all our site ops will now know  how to run the database master switch script next time!

sad-wiki

– brion

European network outage

We’re encountering some networking problems between our Tampa and Amsterdam data centers, which is breaking access to the sites for people in Europe. Mark’s poking to see if it can be resolved; if necessary we’ll reroute European visitors directly to the Tampa center.

Update: Has been resolved.

csw2-knams seems to have gone down

CSW2-knams is down and with it a few servers: pascal, ragweed, clematis, iris, fuchsia and a couple of sql-text*.knams.
It seems this issue mostly affects the toolserver environment.

I am still working on figuring out a way of fixing this and will update once the issue has been resolved.
Sorry for the inconvenience.

Update: Mark was able to resolve the issue. Apparently, the excess temperature due to the HVAC malfunction at the datacenter caused servers to automatically shutdown.

Server named Singer has a sore throat?

In working on the servers, some apache config files were made inoperable.  This is on a misc. services computer named Singer.  This is the host for our blogs, as well as some other web-facing info.  As such, the cached blogs are affected, but not the tech blog.  (It was, but it was the easiest to get back online.)

Apologies for any annoyance this single server downtime may have caused anyone.  Rest assured, it will be fixed and steps will be taken to prevent it from occurring in the future.

English Wikipedia brief outage

We had a crash on our database master for English Wikipedia. Domas is restarting it and swapping it out for another master server; should be back online in a few minutes.

In the meantime, Wikipedia in other languages and all other Wikimedia sites remain unaffected.

wiki-problem

Update 23:39 UTC: We’re back! Looks like approximately 25 minutes of breakage.

An out-of-memory condition on the database master server ended up killing the MySQL daemon…