PMTPA Router Reboot – Scheduled Downtime (Resolved)

Translate This Post

Our primary router for the pmtpa cluster had to be rebooted today at 12:00 GMT.  A line card had died and needed replacing, and the
Image (1) 120px-Gnome-face-sick.svg.png for post 3751
system required a reboot for it to fully take effect.  Once that finished, CentralNotice was adding a lot of overhead and had to be disabled for our caching cluster to catch up.  Then the overload caused the primary database master for S3 to overload, and we are in the process of switching database masters to another server.
If all went as planned, this would have been a quick 5 minute router reboot and back online.  Unfortunately, things do not always work smoothly, so what would have been 5 minutes has been awhile.  This post will be updated as more details are resolved.
Update: We have switched database masters successfully and all sites and projects should once again be fully functional as of 14:13 GMT.
Rob Halsell, Operations Engineer

Archive notice: This is an archived post from blog.wikimedia.org, which operated under different editorial and content guidelines than Diff.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

1 Comment
Inline Feedbacks
View all comments

As tweeted today:
Thanx to all wikipedia systema administrators (and developers) on System Administrator Appreciation Day today.
best wishes from Germany, Achim