Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

Operations

Making Wikimedia Sites faster

Running the fifth largest website in the world brings its own set of challenges. One particularly important issue is the time it takes to render a page in your browser. Nobody likes slow websites, and we know from research that even small delays lead visitors to leave the site. An ongoing concern from both the Operations and Platform teams is to improve the reader experience by making Wikipedia and its sister projects as fast as possible. We ask ourselves questions like: Can we make Wikipedia 20% faster on half the planet?

As you can imagine, the end-user experience differs greatly due to our unique diverse and global readership. Hence, we need to conduct real user monitoring to truly get an understanding of how fast our projects are in real-life situations.

But how do we measure how fast a webpage loads? Last year, we started building instrumentation to collect anonymous timing data from real users, through a MediaWiki extension called NavigationTiming.[1]

There are many factors that determine how fast a page loads, but here we will focus on the effects of network latency on page speed. Latency is the time it takes for a packet to travel from the originating server to the client who made the request.

ULSFO

Earlier this year, our new data center (ULSFO) went fully operational, serving content to Oceania, South-East Asia, and the west coast of North America[2]. The main benefit of this work is shaving up to 70−80ms of round-trip time for some regions of Oceania, East Asia, US and Canada. An area with 360 million Internet users and a total population of approximately one billion people.

We recently explained how we chose which areas to serve from the new data center; knowing the sites became faster for those users was not enough for us, we wanted to know how much faster.

Results

Before we talk about specific results, it is important to understand that having faster network round trip times might not directly result in a faster user experience for users. When network times are faster, resources are retrieved faster, but there are many other factors that influence page latency. This is perhaps better explained with an example: If we need 4 network trips to compose a page, and if round trips 2, 3 and 4 are happening while I am parsing a huge main document (round trip 1), I will only see improvements from the first request. Subsequent ones are done in parallel and totally hidden under the fetching of the first one. In this scenario, our bottleneck for performance is the parsing of the first resource. Not the network time.

With that in mind, what we wanted to know when we analyzed the data from the NavigationTiming extension were two things: How much did our network times improve? and Can users feel the effect of faster network times? Are pages perceived to be faster, and if so, how much?

The data we harvest from the NavigationTiming extension is segregated by country. Thus we concentrated our data analysis on countries in Asia for which we had sufficient data points; we also included the United States and Canada but we were not able to extract data just for the western states. Data for United States and Canada was analyzed at a country level and thus the improvements in latency appear “muffled”.

How much did our network times improve?

The short summary is: network times improved quite a bit. For half of requests, the retrieval of the main document decreased up to 70 ms.

ULSFO Improvement of Network times on Wikimedia Sites

In the opposite graph, the data center rollout is marked with a dashed line. The rollout was gradual, thus gains are not perceived immediately but they are very significant after a few days. The graph includes data for Japan, Korea and the whole SE Asia Region.[3]

We graphed the responseStart–connectStart time which represents the time spent in the network until the first byte arrives, minus the time spent in DNS lookups. For a more visual explanation, take a look at the Navigation timing diagram. If there is a TCP connection drop, the time will include the setup of the new connection. All the data we use to measure network improvements is provided by request timing API, and thus not available on IE8 and below.

User perceived latency

Did the improvement of network times have an impact that our users could see? Well, yes it did. More so for some users than others.

The gains in Japan and Indonesia were remarkable, page load times dropped up to 300ms at the 50th percentile (weekly). We saw smaller (but measurable) improvements of 40 ms in the US too. However, we were not able to measure the impact in Canada.

The dataset we used to measure these improvements is a bigger one than the one we had for network times. As we mentioned before, the Navigation Timing API is not present in old browsers, thus we cannot measure, say, network improvement in IE7. In this case, however, we used a measure of our creation that tells us when a page is done loading called mediaWikiLoadComplete. This measure is taken in all browsers when the page is ready to interact with the user; faster times do mean that the user experience was also faster. Now, how users perceive the improvement has a lot to do with how fast pages were to start with. If a page now takes 700 ms to render instead of one second, any user will be able to see the difference. However a difference of 300 ms in a 4 second page rendering will be unnoticed by most.

Reduction in latency

Want to know more?

Want to know all the details? A (very) detailed report of the performance impact of the ULSFO rollout is available.

Next steps

Improving speed is an ongoing concern, particularly as we roll out new features and we want to make sure that page rendering remains fast. We are keeping our eyes open to new ways of reducing latency, for example by evaluating TCP Fast Open. TCP Fast Open skips an entire round-trip and starts sending data from the server to client before the final acknowledgment of the three-way TCP handshake has been finished.

We are also getting closer to deploying HipHop. HipHop is a virtual machine that compiles PHP bytecode to native instructions at runtime, the same strategy used by Java and C# to achieve their speed advantages. We’re quite confident that this will result in big performance improvements on our sites as well.

We wish you speedy times!

Faidon Liambotis
Ori Livneh
Nuria Ruiz
Diederik van Liere

Notes

  1. The NavigationTiming extension is built on top of the HTML5 component with same name which exposes fine-grained measurements from the moment a user submits a request to load a page until the page has been fully loaded.
  2. Countries and provinces served by ULSFO include: Bangladesh, Bhutan, Hong Kong, Indonesia, Japan, Cambodia, Democratic People’s Republic of Korea, Republic of Korea, Myanmar, Mongolia, Macao, Malaysia, Philippines, Singapore, Thailand, Taiwan, Vietnam, US Pacific/West Coast states (Alaska, Arizona, California, Colorado, Hawaii, Idaho, Montana, New Mexico, Nevada, Oregon, Utah, Washington, Wyoming) and Canada’s western territories (Alberta, British Columbia, Northwest Territories, Yukon Territory).
  3. Countries include: Bangladesh, Bhutan, Hong Kong, Indonesia, Japan, Cambodia, Democratic People’s Republic of Korea, Republic of Korea, Myanmar, Mongolia, Macao, Malaysia, Philippines, Singapore, Thailand, Taiwan, Vietnam.

How RIPE Atlas Helped Wikipedia Users

This post by Emile Aben is cross-posted from RIPE Labs, a blog maintained by the Réseaux IP Européens Network Coordination Centre (RIPE NCC). In addition to being the Regional Internet Registry for Europe, the Middle East and parts of Central Asia, the RIPE NCC also operates RIPE Atlas, a global measurement network that collects data on Internet connectivity and reachability to assess the state of the Internet in real time. Wikimedia engineer Faidon Liambotis recently collaborated with the RIPE NCC on a project to measure the delivery of Wikimedia sites to users in Asia and elsewhere using our current infrastructure. Together, they identified ways to decrease latency and improve performance for users around the world. 

During RIPE 67, Faidon Liambotis (Principal Operations Engineer at the Wikimedia Foundation) and I got into a hallway conversation. Long story short: We figured we could do something with RIPE Atlas to decrease latency for users visiting Wikipedia and other Wikimedia sites.

At that time, Wikimedia had two locations active (Ashburn and Amsterdam), and was preparing a third (San Francisco), to better serve users in Oceania, South Asia, and US/Canada west coast regions. We were wondering about the effects on network latency for users world-wide for this third location and Wikimedia wanted to quantify the effect turning up this location would have.

Wikimedia runs their own Content Delivery Network (CDN), mostly for privacy & cost reasons. Like most CDNs, to geographically balance the traffic to their various points of presence (PoPs), they employ a technique called GeoDNS: a user will, based on the DNS request that is made on their behalf from their DNS resolver, be specifically directed to one of the data centers based on their or their resolver’s IP address. This requires the authoritative DNS servers for Wikimedia sites to know where to best direct the user to. Wikimedia uses gdnsd for authoritative DNS to dynamically respond to those queries based on a region-to-datacenter map.

Some call this ‘stupid DNS tricks‘, others find it useful to decrease latency towards websites. Wikimedia is in the latter group, and we used RIPE Atlas to see how this method performs.

One specific question we wanted answered is where to “split Asia” between the San Francisco and the Amsterdam Wikimedia location. Latency is obviously a function of physical distance, but also the choice of upstream networks. As an example, these choices determine if packets to “other side of the world” destinations tend to be routed clockwise or counter-clockwise.

We scheduled latency measurements from all RIPE Atlas probes towards the three Wikimedia locations we wanted to look at, and visualised what datacenter showed the lowest latency for each probe. You can see the results in Figure 1 below.

Screenshot of latency map. Probes are colored based on the datacenter that shows the lowest measured latency for this particular probe

Figure 1: Screenshot of latency map. Probes are colored based on the datacenter that shows the lowest measured latency for this particular probe.

This latency map shows the locations of RIPE Atlas probes, coloured by what Wikimedia data center has the lowest latency measured from that probe:

  • Orange: the Amsterdam PoP has the lowest latency
  • Green: the Ashburn PoP has the lowest latency
  • Blue: the San Francisco PoP has the lowest latency.

Probes where the lowest latency is over 150ms have a red outline. An interactive version of this map is available here. Note that this is a prototype to show the potential of this approach, so it is a little rough around the edges.

Probes located in India clearly have lower latency towards Amsterdam. Probes in China, South Korea, the Philippines, Malaysia and Singapore showed lower latency towards San Francisco. For other locations in South-East Asia the situation was less clear, but that is also useful information to have, because it shows that directing users to either the Amsterdam or the San Francisco data center seems equally good (or bad). It is also interesting to note that all of Russia, including the two most eastern probes in Vladivostok have lowest latency towards Amsterdam. For the Vladivostok probes Amsterdam and San Francisco are almost the same distance, give or take 100 km. Nearby probes in China, South Korea and Japan have lowest latency towards San Francisco.

There is always the question of drawing conclusions based on a low number of samples, and how representative RIPE Atlas probe hosts are for a larger population. Having some data is better then no data in these cases though, and if a region has a low number of probes that can always be fixed by deploying more probes there. If you live in an underrepresented region you can apply for a probe and make this better.

With this measurement data to back it, Wikimedia has gradually turned up Oceania, South Asian countries and US/Canada states where RIPE Atlas measurements showed minimal latency to, to be served by their San Francisco caching PoP. The geo-config that Wikimedia is running on, is publicly available here.

As for the code that created the measurements and created the latency map: This was all prototype-quality code at best, so I originally planned to find a second site where we could do this, so to see if we could generalise scripts and visualisation and then share.

At RIPE 68 there was interest in even this raw prototype code for doing things with data centers, latency and RIPE Atlas, so we ended up sharing this code privately, and have heard of progress made on that already. In the meantime we’ve put up the code that created the latency map on github. Again: it’s a prototype, but if you can’t wait for a better version, please feel free to use and improve it.

Conclusion

If you have an interesting idea, and have no time, or other things are stopping you from implementing it, please let us know! You can always chat with us at a RIPE meetingregional meeting or any other channels. We don’t have infinite time, but we can definitely try out things, especially ideas that will improve the Internet and/or improve the life of network operators.

Emile Aben

Revamped Wikipedia app now available on Android

The Main Page of the English Wikipedia on the new Android app.

If you love Wikipedia and have an Android phone, you’re in for a treat! Today we’ve released a revamped Wikipedia for Android app, now available on Google Play.

Our new app is native from the ground up, making it the fastest way to experience Wikipedia on a phone. For the first release, we’ve focussed on creating a great browsing and reading experience. Whether you’re looking up a specific fact or looking to spend a day learning a new topic, our search and table of contents features get you to the information you need, quickly and intuitively. We’re also offering the ability to edit in the app, so you can help make Wikipedia better for billions of readers around the world.

What features are included?

  • Speed – Our new, native app allows you to browse and edit Wikipedia faster than ever before.
  • Editing – You can edit Wikipedia on the app. Logged in or logged out, we thank you for all your contributions.
  • Recent pages – We provide you with your reading history, so you can tap as many links as you like without ever getting lost.
  • Saved pages – You can save select pages for offline reading and browse them even when you don’t have a data connection.
  • Share – Use your existing social networking apps to share in the sum of all human knowledge.
  • Language support – The app allows you to seamlessly switch to reading Wikipedia written in any language.
  • Wikipedia Zero – We’ve partnered with cellular carriers around the world to provide Wikipedia free of data charges to users in many developing areas.

Coming soon

  • Night mode – We’ve gotten lots of great beta user feedback; one feature people love is reading Wikipedia in darker environments. The inverted colour scheme offered by night mode will make that much easier.
  • Discussions – Talk pages are an important part of Wikipedia for both new users and experienced editors alike. We’re bringing them to the app.

This release is just the beginning! We’re still working hard on creating new features to make the app the best Wikipedia reading and editing experience out there. Whether you’re a long-time user of Wikipedia on Android or are brand new to the app, give it a spin and let us know what you think. This is just the first step; we hope this app will grow with us, and we’re excited to have our community help us evolve it.

Please help us improve this app by sending a note to our mailing list, mobile-android-wikipedia@wikimedia.org, or writing a comment here.

Thank you!

Dan Garry, 
Associate Product Manager, Mobile Apps

Wikimedia Foundation selects CyrusOne in Dallas as new data center

Our new data center will be co-located at CyrusOne in Dallas/Carrollton. It will be able to handle the full load of Wikimedia’s global traffic in case of an emergency, and will handle partial load at all times.

In October 2013, we launched a public request for proposals regarding a new data center location in the continental US for co-locating Wikimedia infrastructure. We’ve now concluded this process and selected the CyrusOne facility in Dallas/Carrollton.

Wikimedia’s primary data center is in Ashburn, Virginia, and we’ve been preparing to end our remaining hosting presence in Tampa, Florida (which had historical reasons: Florida is where Wikipedia grew up).

We were looking for a facility that can handle the full load of Wikimedia’s global traffic in case of an emergency (and that will handle partial load at all times), and will now build out our presence in CyrusOne with that goal in mind.

Selecting a new data center is not a decision we take lightly, as these relationships tend to last many years and are very important for Wikimedia’s site reliability. As part of this process, we evaluated more than 39 bids, conducted 18 site visits, and arrived at a shortlist of 3 vendors.

CyrusOne’s Carrollton data center is a modern and large facility with a highly redundant and efficient power and cooling infrastructure.

The CyrusOne bid met our key requirements at a very competitive price. It is a modern and large facility with a highly redundant and efficient power and cooling infrastructure. In addition, the state of Texas maintains an independent power grid, which may be beneficial in case of major power issues affecting our Ashburn facility.

We look forward to working with CyrusOne in years to come as we continue to increase reliability and responsiveness of Wikimedia’s sites and services, to provide free knowledge to hundreds of millions of people around the world.

Mark Bergsma
Director of Technical Operations and Lead Operations Architect

Migrating Wikimedia Labs to a new Data Center

As part of ongoing efforts to reduce our reliance on our Tampa, Florida data center, we have just moved Wikimedia Labs to EQIAD, the new data center in Ashburn, Virginia. This migration was a multi-month project and involved hard work on the part of dozens of technical volunteers. In addition to reducing our reliance on the Tampa data center, this move should provide quite a few benefits to the users and admins of Wikimedia Labs and Tool Labs.

Migration objectives

We had several objectives for the move:

  1. Upgrade our virtualization infrustructure to use OpenStack Havana;
  2. Minimize project downtime during the move;
  3. Stop relying on nova-network and start using Neutron;
  4. Convert the Labs data storage system from GlusterFS to NFS;
  5. Identify abandoned and disused Labs resources.

Upgrade and Minimize Downtime

Wikimedia Labs uses OpenStack to manage the virtualization back-end. The Tampa Labs install was running a slightly old version of OpenStack, ‘Folsom’. Folsom is more than a year old now, but OpenStack does not provide an in-place upgrade path that doesn’t require considerable downtime, so we’ve been living with Folsom to avoid disrupting existing Labs services.

Similarly, a raw migration of Labs from one set of servers to another would have required extensive downtime, as simply copying all of the data would be the work of days.

The solution to both 1) and 2) was provided by OpenStack’s multi-region support. We built an up-to-date OpenStack install (version ‘havana’) in the Ashburn center and then modified our Labs web interface to access both centers at once. In order to ease the move, Ryan Lane wrote an OpenStack tool that allowed users to simultaneously authenticate in both data centers, and updated the Labs web interface so that both data centers were visible at the same time.

At this point (roughly a month ago), we had two different clouds running: one full and one empty. Because of a shared LDAP back-end, the new cloud already knew about all of our projects and users.

Two clouds, before migration

Then we called on volunteers and project admins for help. In some cases, volunteers built fresh new Labs instances in Ashburn. In other cases, instances were shut down in Tampa and duplicated using a simple copy script run by the Wikimedia Operations team. In either case, project functions were supported in both data centers at once so that services could be switched over quickly and at the convenience of project admins.

Two clouds, during migration

As of today, over 50 projects have been copied to or rebuilt in Ashburn. For those projects with uptime requirements, the outages were generally limited to a few minutes.

Switch to OpenStack Neutron

We currently rely on the ‘nova-network’ service to manage network access between Labs instances. Nova-network is working fine, but OpenStack has introduced a new network service, Neutron, which is intended to replace nova-network. We hoped to adopt Neutron in the Ashburn cloud (largely in order to avoid being stuck using unsupported software), but quickly ran into difficulties. Our current use case (flat DHCP with floating IP addresses) is not currently supported in Neutron, and OpenStack designers seem to be wavering in their decision to deprecate nova-network.

After several days of experimentation, expedience won out and we opted to reproduce the same network setup in Ashburn that we were using in Tampa. We may or may not attempt an in-place switch to Neutron in the future, depending on whether or not nova-network continues to receive upstream support.

Switch to NFS storage

Most Labs projects have shared project-wide volume for storing files and transferring data between instances. In the original Labs setup, these shared volumes used GlusterFS. GlusterFS is easy to administer and designed for use cases similar to ours, but we’ve been plagued with reliability issues: in recent months, the lion’s share of Labs failures and downtime were the result of Gluster problems.

When setting up Tool Labs last year and facing our many issues with GlusterFS, Marc-Andre Pelletier opted to set up a new NFS system to manage shared volumes for the Tool Labs project. This work has paid off with much-improved stability, so we’ve adopted a similar system for all projects in Ashburn.

Again, we largely relied on volunteers and project admins to transfer files between the two systems. Most users were able to copy their data over as needed, scping or rsyncing between Tampa and Ashburn instances. As a hedge against accidental data loss, the old Gluster volumes were also copied over into backup directories in Ashburn using a simple script. The total volume of data copied was around 30 Terabytes; given the many-week migration period, network bandwidth between locations turned out not to be a problem.

Identify and reclaim wasted space

Many Labs projects and instances are set up for temporary experiments, and have a short useful life. The majority of them are cleaned up and deleted after use, but Labs still has a tendency to leak resources as the odd instance is left running without purpose.

We’ve never had a very good system for tracking which projects are or aren’t in current use, so the migration was a good opportunity to clean house. For every project that was actively migrated by staff or volunteers, another project or two simply sat in Tampa, unmentioned and untouched. Some of these projects may yet be useful (or might have users but no administrators), so we need to be very careful about prematurely deleting them.

Projects that were not actively migrated (or noticed, or mentioned) during the migration period have been ‘mothballed’. That means that their storage and VMS were copied to Ashburn, but are left in a shutdown state. These instances will be preserved for several months, pending requests for their revival. Once it’s clear that they’re fully abandoned (in perhaps six months), they will be deleted and the space reused for future projects.

Conclusions

In large part, this migration involved a return to older, more tested technology. I’m still hopeful that in the future Labs will be able to make use of more fundamentally cloud-designed technologies like distributed file shares, Neutron, and (in a perfect world) live instance migration. In the meantime, though, the simple approach of setting up parallel clouds and copying things across has gone quite well.

This migration relied quite heavily on volunteer assistance, and I’ve been quite charmed by how gracious the vast majority of volunteers were about this inconvenience. In many cases, project admins regarded the migration as a positive opportunity to build newer, cleaner projects in Ashburn, and many have expressed high hopes for stability in the new data center. With a bit of luck we’ll prove this optimism justified.

Andrew Bogott, DevOps Engineer

Request for proposals: New datacenter in the continental US

The Wikimedia Foundation’s Technical Operations team is seeking proposals on the provisioning of a new datacenter facility.

After working through the specifics internally, we now have a public RFP posted and ready for proposals. We invite any organization meeting the requirements outlined to submit a proposal for review. Most of the relevant details are in the document itself, but feel free to reach out to myself or anyone on the Technical Operations team should anyone have any questions.

Please, feel free to forward this link far and wide; have colleagues, contacts or friends in the datacenter sector? Then please, forward it on! :)

Below are the primary requirements, excerpted from the RFP:

Primary Requirements

  • The data center location must be in the midwestern/western continental US (i.e., Chicago westward).
  • The capacity for at least 32 enclosures initially; expansion possibilities (first right of refusal in contract on adjacent or nearby cage area) for another row of 8.
  • (more…)

HTTPS by default beta program

This post is available in 2 languages:
Español 7% • English 100%

English

Now that we’ve enabled HTTPS by default for logged-in users, our next major objective is to enable HTTPS by default for anonymous users. We have a number of steps to take to arrive at this goal, including a couple important initial steps, such as conducting proper testing for load and shaking out bugs on smaller scale deployments before mass deployment.

For both load testing and shaking out bugs, we’d like to switch some Wikimedia projects to HTTPS by default. We’re launching a beta program where you can opt your project in for HTTPS by default testing. Signing up for this program doesn’t necessarily mean your project will be selected in the first rounds of testing, but it will put your project into a pool of wikis we can select from.

To sign up for the program, please get consensus from the contributor community on your project and add it to the list on meta.

Ryan Lane
Operations Engineer, Wikimedia Foundation

(more…)

HTTPS enabled by default for logged-in users on Wikimedia sites

This post is available in 2 languages:
Español 7% • English 100%

English

Today, August 28, the Wikimedia Foundation is making a change to the software that powers the Wikimedia projects: By default, all logged-in users will now be using HTTPS to access Wikimedia sites. What this does is encrypt the connection between the Wikimedia servers and the user’s browser so that the information sent between the two is not readable by anyone else. This is in response to the recent concerns over the privacy and security of our user community, and we explained the rationale for this change in our post about the future of HTTPS at Wikimedia.

What this means for you

How this works is simple: If a user wants to log in, they will be redirected to use HTTPS for the login, thus keeping their username and password secure. After they are logged in, they stay on the HTTPS version of the Wikimedia site they are using.

Excluded Countries

Some users live in areas where HTTPS is not an easy option, most times because of explicit blocking by a government. At the request of these communities, we have made an explicit exclusion for users from those affected countries. Simply put, users from China and Iran will not be required to use HTTPS for logging in, nor for viewing any Wikimedia project site.

Disabling

Are you having a slow or unreliable experience while browsing Wikimedia sites over HTTPS? Then you can turn HTTPS off in your user preferences, under the “User profile” tab: Uncheck “Always use a secure connection when logged in”. You will need to log out and log in again for the preference to take effect. But remember, you will still need to log in using the secure HTTPS process.

HELP!

For further details, please see the HTTPS page on Meta-Wiki, which is available in several languages.

Are you unable to log in and edit a Wikimedia wiki after this change? Please contact the Wikimedia Foundation Operations team via any means you find comfortable, including this blog post’s comments section, on IRC in the #wikimedia-operationsconnect channel, or via the https@wikimedia.org email address.

Greg Grossmeier
Release Manager, Wikimedia Foundation

(more…)

The future of HTTPS on Wikimedia projects

This post is available in 2 languages: 中文 7%English 7%

English

The Wikimedia Foundation believes strongly in protecting the privacy of its readers and editors. Recent leaks of the NSA’s XKeyscore program have prompted our community members to push for the use of HTTPS by default for the Wikimedia projects. Thankfully, this is already a project that was being considered for this year’s official roadmap and it has been on our unofficial roadmap since native HTTPS was enabled.

Our current architecture cannot handle HTTPS by default, but we’ve been incrementally making changes to make it possible. Since we appear to be specifically targeted by XKeyscore, we’ll be speeding up these efforts. Here’s our current internal roadmap:

  1. Redirect to HTTPS for log-in, and keep logged-in users on HTTPS. This change is scheduled to be deployed on August 21, at 16:00 UTC. Update as of 21 August: we have delayed this change and will now deploy it on Wednesday, August 28 at 20:00 UTC/1pm PT.
  2. Expand the HTTPS infrastructure: Move the SSL terminators directly onto the frontend varnish caches, and expand the frontend caching clusters as necessitated by increased load.
  3. Put in engineering effort to more properly distribute our SSL load across the frontend caches. In our current architecture, we’re using a source hashing based load balancer to allow for SSL session resumption. We’ll switch to an SSL terminator that supports a distributed SSL cache, or we’ll add one to our current solution. Doing so will allow us to switch to a weighted round-robin load balancer and will result in a more efficient SSL cache.
  4. Starting with smaller projects, slowly soft-enable HTTPS for anonymous users by default, gradually moving toward soft-enabling it on the larger projects as well. By soft-enable we mean changing our rel=canonical links in the head section of our pages to point to the HTTPS version of pages, rather than the HTTP versions. This will cause search engines to return HTTPS results, rather than HTTP results.
  5. Consider enabling perfect forward secrecy. Enabling perfect forward secrecy is only useful if we also eliminate the threat of traffic analysis of HTTPS, which can be used to detect a user’s browsing activity, even when using HTTPS.
  6. Consider doing a hard-enable of HTTPS. By hard-enable we mean force redirecting users from HTTP pages to the HTTPS versions of those pages. A number of countries, China being the largest example, completely block HTTPS to Wikimedia projects, so doing a hard-enable of HTTPS would probably block large numbers of users from accessing our projects at all. Because of this, we feel this action would probably do more harm than good, but we’ll continue to evaluate our options here.
  7. Consider enabling HTTP Strict Transport Security (HSTS) to protect against SSL-stripping man-in-the-middle attacks. Implementing HSTS could also lead to our projects being inaccessible for large numbers of users as it forces a browser to use HTTPS. If a country blocks HTTPS, then every user in the country that received an HSTS header would effectively be blocked from the projects.

Currently we don’t have time frames associated with any change other than redirecting logged-in users to HTTPS, but we will be making time frames internally and will update this post at that point.

Until HTTPS is enabled by default, we urge privacy-conscious users to use HTTPS Everywhere or Tor [1].

Ryan Lane
Operations Engineer, Wikimedia Foundation

[1]: There are restrictions with Tor; see Wikipedia’s information on this.

(more…)

Wikipedia Adopts MariaDB

This past Wednesday marked a milestone in the evolution of Wikimedia’s Database infrastructure: the completion of the migration of the English and German Wikipedias, as well as Wikidata, to MariaDB 5.5.

For the last several years, we’ve been operating the Facebook fork of MySQL 5.1 with most of our production environment running a build of r3753. We’ve been pleased with its performance; Facebook’s MySQL team contains some of the finest database engineers in the industry and they’ve done much to advance the open source MySQL ecosystem.

That said, MariaDB’s optimizer enhancements, the feature set of Percona’s XtraDB (many overlap with the Facebook patch, but I particularly like add-ons such as the ability to save the buffer pool LRU list, avoiding costly warmups on new servers), and of Oracle’s MySQL 5.5 provide compelling reasons to consider upgrading. Equally important, as supporters of the free culture movement, the Wikimedia Foundation strongly prefers free software projects; that includes a preference for projects without bifurcated code bases between differently licensed free and enterprise editions. We welcome and support the MariaDB Foundation as a not-for-profit steward of the free and open MySQL related database community.

Preparing For Change

Major version upgrades of a production database are not to be made lightly. In fact, as late as 2011, some Wikipedia languages were still running a heavily patched version of MySQL 4.0 — the migration to 5.1 required both schema changes, and direct modifications of data dumps to alter the padding of binary-typed columns. MySQL 5.5 contains a variety of incompatibilities with prior versions, thanks in part to better compliance with SQL standards. Changes to the query optimizer between versions may also change the execution plan for common queries, sometimes for the better but historically, sometimes not. SQL behavior changes may result in replication breakage or data consistency issues, while performance regressions, whether from query plan or other changes, can cause site outages. This calls for a lot of testing.

Compatibility testing was accomplished by running MariaDB replicas outside of production, watching for replication errors, replaying production read queries and validating results. After identifying and fixing a couple of MediaWiki issues that surfaced as replication errors (along the lines of trying to set unsigned integer types to negative values which previously caused a wrap-around instead of an error) we replayed production read queries using pt-upgrade from Percona Toolkit. Pt-upgrade replays a query log against two servers, and compares the responses for variances or errors. Scripts originally developed for our recent datacenter migration to simultaneously warmup many standby databases from current production read traffic helped with rough load testing and benchmarking. Along the way, a pair of bugs in MariaDB 5.5.28 and 5.5.29 were identified, one of which was a rare but potentially severe performance regression related to a new query optimizer feature. The MariaDB team was very responsive and quick to offer solutions, complete with test cases.

Performance Testing In Production

As a read-heavy site, Wikipedia aggressively uses edge caching. Approximately 90% of pageviews are served entirely from the edge while at the application layer, we utilize both memcached and redis in addition to MySQL. Despite that, the MySQL databases serving English Wikipedia alone reach a daily peak of ~50k queries/second. Most are read queries served by load-balanced slaves, depending on consistency requirements. 80% of the English Wikipedia query load (up to 40k qps) are typically handled by just two database servers at any given time. Our most common query type (40% of all) has a median execution time of ~0.2ms and a 95th percentile time of ~50ms. To successfully use MariaDB in production, we need it to keep up with the level of performance obtained from Facebook’s MySQL fork, and to behave consistently as traffic patterns change.

Ishmael views of pt-query-digest data collected via tcpdump for the most common Wikipedia read queries (pdf). The first page of a query shows data from db1042, running mysql-facebook-r3753, the second from db1043 over the same time period, running MariaDB 5.5.30.

Ishmael views of pt-query-digest data collected via tcpdump for the most common Wikipedia read queries (pdf). The first page of a query shows data from db1042, running 5.1fb-r3753, the second from db1043 over the same time period, running MariaDB 5.5.30.

Once confident that application compatibility issues were solved and comfortable with performance obtained under benchmark conditions, it was time to test in production. One of the production read slaves from the English Wikipedia shard was taken out of rotation, upgraded to MariaDB 5.5.30, and then returned for warmup. The load balancer weight was then gradually increased until it and a server still running MySQL 5.1-facebook-r3753 were equally weighted and receiving most of the query load.

Also from the Percona Toolkit, we use pt-query-digest across all database servers to collect query performance data which is then stored in a centralized database. Query data is collected from two sources per server and stored in separate buckets — from the slow query which only captures queries exceeding 450ms, and from periodic brief sampling of all queries obtained by tcpdump. Ishmael provides a convenient way to visualize and inspect query digest data over time. Using it, along with direct analysis of the raw data, allowed us to validate that every query continued to perform within acceptable bounds.

For our most common query type, 95th percentile times over an 8-hour period dropped from 56ms to 43ms and the average from 15.4ms to 12.7ms. 50th percentile times remained a bit better with the 5.1-facebook build over the sample period, 0.185ms vs. 0.194ms. Many query types were 4-15% faster with MariaDB 5.5.30 under production load, a few were 5% slower, and nothing appeared aberrant beyond those bounds.

From there, we upgraded the remaining slaves one by one, before finally rotating in a newer upgraded class of servers to act as masters. The switch was seamless and performance continues to look good. We’ll be completing the migration of shards covering the rest of our projects over the next month. Beyond that, we’re looking forward to the future release of MariaDB 10 (global transaction IDs!), and are continually assessing ways to improve our data storage infrastructure. If you’re interested in helping, the Wikimedia Foundation is hiring!

Asher Feldman, Site Architect