Thanks to Priyanka’s wonderful work in theming Bugzilla and ironing out the last couple bugs and extensions, we are finally ready to move on with the upgrade. As a side effect, Bugzilla will be down for a couple of hours (let’s say 2 to be on the safe side) around lunch-time. (Edit Addition: 2010-01-19 between 20:00 GMT and 22:00 GMT)
Posts Tagged ‘downtime’
Our primary router for the pmtpa cluster had to be rebooted today at 12:00 GMT. A line card had died and needed replacing, and the
system required a reboot for it to fully take effect. Once that finished, CentralNotice was adding a lot of overhead and had to be disabled for our caching cluster to catch up. Then the overload caused the primary database master for S3 to overload, and we are in the process of switching database masters to another server.
If all went as planned, this would have been a quick 5 minute router reboot and back online. Unfortunately, things do not always work smoothly, so what would have been 5 minutes has been awhile. This post will be updated as more details are resolved.
Update: We have switched database masters successfully and all sites and projects should once again be fully functional as of 14:13 GMT.
Rob Halsell, Operations Engineer
Uploading and generation of new thumbs will be temporarily disabled on Wikimedia sites while we patch & reboot the server to fix the performance issues we’ve been seeing.
We hope to be done within a couple hours (by 22:00 UTC or so — 3pm PDT), but it could run shorter or longer.
Rough procedure for the curious:
- Take image thumbnailing servers offline
- Disable uploads
- Unmount file server from web servers
- Patch & reboot file server: rebooted – 21:00 UTC
- Remount file server on web servers – 21:09
- Put image thumbnailing servers back online – 21:12
- Re-enable uploads
<- Done 21:18!
With the kernel fix, the file server should now behave better. We’ll then be able to continue our more leisurely migration of thumbnail files to another server, freeing up disk space on the primary box.
Updated 20:20 UTC: Added our hoped-for ETA
Update 20:44 UTC: A side-effect of taking the image server offline broke account creation and some editing which triggers an anti-bot captcha. Have switched to the simple captcha mode which doesn’t use images for now.
Update 20:56 UTC: Just noting that this affects <math> and <timeline> rendering as well. You may see some math rendering errors until we’ve completed; sorry!
Update 21:12 UTC: File server is back online and uploads are re-enabled. So far so good!
Brion Vibber, Lead Software Architect
Wikimedia’s PDF export service is temporarily down; the server failed to reboot after a routine kernel upgrade. It should be resolved or replaced with a spare box within a couple hours…
Update: Server is back online.
Our PDF export server is presently down. It had to be rebooted to organize and route some power cables in our racks. When it powered back on, it is failing to load all software correctly. We are working on resolving it, I just wanted to post something here on the blog since it is the first place that many people check when they think some service is broken.
We’ve been seeing some general slowdowns in our image and media file serving recently, including some instances in the last couple days where the sites as a whole have been affected to the point of extreme slowness or temporary inaccessibility.
Domas believes this is related to this reported problem with NFS performance when ZFS snapshots are active. We’ve had some luck so far with it improving after dropping older snapshots (possibly along with restarting NFS and temporarily disabling the image scaler servers to give it a little breathing room to reset).
We’ve been planning for some time to redo the way we access our media files internally which can help reduce the impact on the rest of the site when load problems on the file servers occur, but we might also be able to spread out the load among multiple servers to improve things even more.
Updates will come as we get things back on track…
Update 2009-07-15: We’re temporarily shutting off uploads while we apply the ZFS fix patch and reboot the main file server. You may see some missing images or funky error messages for a little bit, but the sites should otherwise continue working normally until the file server is back up.
Update 2: Server is patched and uploads are back online. This should resolve our performance problems while we continue rearranging the upload servers to be more future-proof.
Brion Vibber, Lead Software Architect
I am sure that many folks noticed that on the morning of 2009-06-26, techblog.wikimedia.org and blog.wikimedia.org went down. It turns out that some of the parts of our WordPress installations were compromised. I do not want to get in to a direct show and tell of what they did, but hopefully we have hardened the installation to the point that it will not occur again.
This is why the blogs exist on their own server, so when things like this happen we can minimize the impact. The blogs are both up and running now, along with the other services that were affected. All but techblog was back online before Friday was over, techblog lagged behind until today. (As techblog was the point of exploit, we got everything else back up first.) Other affected services were the Open Conference Systems site for Wikimania 2009, as well as our survey software. Both of those were back online ASAP after the incident and the rest followed after.
Of course, it was hard to get this information out to folks when the blogs were down! It goes to show how easily using the blogs to get info out has been, since without it we had to scramble to get the information out of other channels.
Thanks to everyone who assisted in the restoration, and also thanks to everyone for their patience while the system was fixed.
CSW2-knams is down and with it a few servers: pascal, ragweed, clematis, iris, fuchsia and a couple of sql-text*.knams.
It seems this issue mostly affects the toolserver environment.
I am still working on figuring out a way of fixing this and will update once the issue has been resolved.
Sorry for the inconvenience.
Update: Mark was able to resolve the issue. Apparently, the excess temperature due to the HVAC malfunction at the datacenter caused servers to automatically shutdown.
In working on the servers, some apache config files were made inoperable. This is on a misc. services computer named Singer. This is the host for our blogs, as well as some other web-facing info. As such, the cached blogs are affected, but not the tech blog. (It was, but it was the easiest to get back online.)
Apologies for any annoyance this single server downtime may have caused anyone. Rest assured, it will be fixed and steps will be taken to prevent it from occurring in the future.