Wikimedia failover test—expected impact for editors

Translate This Post

Photo by Arild Vågen, CC BY-SA 3.0.
Photo by Arild Vågen, CC BY-SA 3.0.

During the month of April, all Wikimedia wikis will be placed into read-only mode for two short periods—an action that will allow the Wikimedia Foundation’s Technology department to test services in a new secondary data center in Texas (referred to as “codfw”).
The new data center is a replica of our cluster in Virginia. The main purpose of this data center is to improve the reliability and failover capabilities of Wikipedia and all of our sites for users around the world. It maintains a full, up-to-date copy of the databases for Wikipedia and other projects, plus many other services. In case of any type of disaster at the main data center in Virginia, the Operations team expects to be able to transfer all traffic to the secondary data center in Texas within minutes.
Upcoming test for the new data center
Major pieces of our infrastructure have been successfully deployed or tested there with actual live traffic, but until now, the heart of our sites has been missing: MediaWiki itself. This is changing. Working together with teams in Technology and several outside of it, the Technology department is now ready to perform a failover test, during which we will transfer all application server traffic and tightly coupled service dependencies to the new data center for a minimum of 48 hours. At the end of the test, we will transfer it all back again.
This failover process is scheduled to happen during the week of 18 April, with the actual switchovers beginning on Tuesday, 19 April at 14:00 UTC and Thursday, 21 April at 14:00 UTC. Any changes to this schedule will be noted on our Wikitech calendar.
Effect of this test on editors and other contributors to our sites
Ideally we’d make this switchover without impact to our users, but limitations in MediaWiki prevent that at this time.  At the start and end of this test, we will have to place all wikis in read-only mode for a short time. We expect this step to take approximately 15 to 30 minutes each time.
During the week of 18 April, we will be halting all non-essential code deployments. This means that the regular MediaWiki deployment process will be stopped, and no other non-critical deployments will be done that week.
The process for this is quite involved today, but this switchover test will give us information that we can use to make the process simpler, faster, and more secure in the future.  We hope to not only greatly reduce the disruption for our users and the time needed to make the switch, but also to reduce the amount of manual effort necessary. We appreciate your patience while we improve this essential infrastructure that helps us to keep useful information from the projects available on the Internet, free of charge, in perpetuity.
Mark Bergsma, Director of Technical Operations and Lead Operations Architect
Wikimedia Foundation

Archive notice: This is an archived post from blog.wikimedia.org, which operated under different editorial and content guidelines than Diff.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

4 Comments
Inline Feedbacks
View all comments

So, we have to fail the system to test the failover system? This is one step that many failover systems never take, and then they fail, for good.

I also agree that it’s preferable to perform a real fullscale switch in a scheduled time where there are enough teams available to monitor what happens and see how the failover mechanism can be improved later. However I’m not sure that the selected schedule was the least disruptive for users, notably for this first test, that could have been made in a less critical time. But a true failover event could happen at anytime in the future where there will be much less people available to make the switch. This scheduled test is then necessary to see how the processs… Read more »

[…] You can read about a previous similar and successful failover test in a blog post from April 2016. […]

[…] You can read about a previous similar and successful failover test in a blog post from April 2016. […]