Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

Posts by Mark Bergsma

Thumbnail issues being resolved

Last Monday, our Solaris server that contains all image thumbnails developed problems. It ran out of memory, became too slow and eventually even started to crash. (For the technically inclined: we think the kernel is leaking some file system structure in kernel memory.) This caused missing thumbnails across Wikimedia projects.

We addressed these problems in the following ways:

  • We decreased the load on this server by adapting the Squid configuration, so it would have to handle fewer requests.
  • We ordered more memory, in order to double the total physical memory in the relevant systems.
  • We set up two new Linux servers that will eventually replace the Solaris server.

At first, the addition of these Linux servers in a partially caching setup seemed enough to fix the immediate problem, while gradually copying all thumbnail files, allowing us to replace the Solaris server completely.

However, on Saturday night the Solaris server started crashing repeatedly, making it necessary to engage the image scalers to regenerate a large part of the missing thumbnails. This is causing some slowness of loading and generating new (uncached) thumbnails.

Fortunately, most users have not experienced serious problems while using the site, since most thumbnails are cached by our HTTP caching layer. It is impossible to determine exactly how long it will take to recover completely from the slower service, but we expect that this will take no more than a few days.

Over the past months we have been developing a new and more scalable architecture for media storage, which will solve these problems once and for all. We hope to deploy this new architecture within a few months, also utilizing the new data center. Please watch the Tech Blog for updates on this project.

Wikimedia projects down due to power problem in primary data center

Starting at 0:10 UTC on July 5th, the Wikimedia Foundation suffered from
intermittent, partial power failures in the internal power network of
one of its main data centers in Tampa, Florida. Due to the temporary
unavailability of several critical systems and the large impact on the
available systems capacity, all Wikimedia projects went down. The power
situation stabilized at 1:12 UTC, and systems and services recovery has
been taking place since. We expect all projects to be back online and
editable around 4:00 UTC.

Global Outage (cooling failure and DNS)

Due to an overheating problem in our European data center many of our servers turned off to protect themselves. As this impacted all Wikipedia and other projects access from European users, we were forced to move all user traffic to our Florida cluster, for which we have a standard quick failover procedure in place, that changes our DNS entries.

However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally. This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects.

We apologize for the inconvenience this has caused.

Update: Unfortunately, for many, this outage seems to have lasted longer than an hour. It appears that many ISPs’ DNS resolvers do not honor the so-called Negative Cache TTL that we send (1 hour), and instead use a longer value. We have circumvented this problem by renaming the affected DNS record to something else.

Update 21:32 UTC: Our SSL gateway, secure.wikimedia.org, was disabled due to overload issues, but is now back up.