Wikimedia blog

News from inside the Wikimedia Foundation.org

Outage

Downtime on en.wikipedia.org resolved

We had 52 minutes of downtime on the English-language Wikipedia site today; only en.wikipedia.org was affected. Our master database server was thrown into a funky state in which hundreds of access threads were stuck in the “statistics” state — which seems to be MySQL’s way of saying “I’ve fallen and I can’t get up”.

It’s unclear exactly what set it off, but basically nothing works until you restart MySQL. After switching the site to an alternate master database, all has been well.

At 52 minutes from start of event, this took us a bit longer than I’d like to resolve — we had to percolate through a couple levels of alert calls before we finished diagnosing it and getting the DB switch pushed through. (Sorry to wake you up early Tim!)

A similar event in future should be fixable within a few minutes, thanks to Tim’s work on making the master-switch system more foolproof. We’re fixing up our internal documentation so all our site ops will now know  how to run the database master switch script next time!

sad-wiki

– brion

European network outage

We’re encountering some networking problems between our Tampa and Amsterdam data centers, which is breaking access to the sites for people in Europe. Mark’s poking to see if it can be resolved; if necessary we’ll reroute European visitors directly to the Tampa center.

Update: Has been resolved.

csw2-knams seems to have gone down

CSW2-knams is down and with it a few servers: pascal, ragweed, clematis, iris, fuchsia and a couple of sql-text*.knams.
It seems this issue mostly affects the toolserver environment.

I am still working on figuring out a way of fixing this and will update once the issue has been resolved.
Sorry for the inconvenience.

Update: Mark was able to resolve the issue. Apparently, the excess temperature due to the HVAC malfunction at the datacenter caused servers to automatically shutdown.

Server named Singer has a sore throat?

In working on the servers, some apache config files were made inoperable.  This is on a misc. services computer named Singer.  This is the host for our blogs, as well as some other web-facing info.  As such, the cached blogs are affected, but not the tech blog.  (It was, but it was the easiest to get back online.)

Apologies for any annoyance this single server downtime may have caused anyone.  Rest assured, it will be fixed and steps will be taken to prevent it from occurring in the future.

English Wikipedia brief outage

We had a crash on our database master for English Wikipedia. Domas is restarting it and swapping it out for another master server; should be back online in a few minutes.

In the meantime, Wikipedia in other languages and all other Wikimedia sites remain unaffected.

wiki-problem

Update 23:39 UTC: We’re back! Looks like approximately 25 minutes of breakage.

An out-of-memory condition on the database master server ended up killing the MySQL daemon…