Breaking up the MediaWiki codebase into separate PHP libraries can benefit many developers across the open software movement.  Wikimedia Zürich 2014 Hackathon photo by Christian Meixner, freely licensed under CC-BY-SA 3.0.

Breaking up the MediaWiki codebase into separate PHP libraries can benefit many developers across the open software movement. Wikimedia Zürich 2014 Hackathon photo by Christian Meixner, CC-BY-SA 3.0.

MediaWiki is the free open source wiki software used to power Wikipedia and thousands of other wikis. It is written in the PHP programming language, because it is widely available and offers a low barrier to code contributions. Over the last 12 years, MediaWiki has grown from a small collection of PHP scripts to almost 1,800 PHP scripts in the core application alone. The contributions of hundreds of individual developers have helped make it a feature-rich, secure and scalable platform capable of powering some of the largest collaboratively edited reference projects in the world.

Unfortunately, MediaWiki has grown to a level of complexity that makes it difficult to understand the code base and reason about the consequences of change. As a result, contributing to MediaWiki is harder than it could be, compared to PHP. A significant portion of the PHP code used in MediaWiki is only incidentally related to the notion of a wiki.

To address this problem, the Wikimedia Foundation has begun breaking MediaWiki into smaller, reusable libraries which can be developed more easily and integrated into any PHP application. We hope that other applications can benefit from MediaWiki’s robust platform which supports sites that scale up to 20 billion page views a month.

Breaking up the codebase

The first two libraries to come out of MediaWiki’s legacy codebase were CSSJanus and CDB. CSSJanus is a PHP port of the original Python library that is used by MediaWiki to convert CSS stylesheets between left-to-right and right-to-left orientations. The PHP port has been used in MediaWiki’s ResourceLoader since 2010 and has been tested extensively for correctness and performance. Both the PHP port and a JavaScript version are now being managed as an independent FOSS project that can benefit not only MediaWiki but also a much wider audience of software developers.

CDB, short for “constant database”, is a fast and reliable file-based key-value store. MediaWiki uses CDB files to cache localized interface messages. Wikimedia’s MediaWiki cluster serves content and chrome in 287 languages, so access to localized strings has to be fast and reliable. We have found CDB to be an excellent solution, and are excited to finally share our PHP CDB library in reusable form. The library provides a wrapper for PHP’s native dba_* functions as well as a pure-PHP implementation for environments like HHVM that do not have access to the native functions.

Using Composer to import libraries

Having extracted these libraries, we needed to reintroduce their use in the core application and maintain the ability to track the version of the library in use with each MediaWiki version deployed. Managing external dependencies has historically been a roadblock to decomposing the larger application without adding too much complexity to the deployment process. Luckily, Composer has largely solved this problem in the modern PHP stack.

Composer logo by WizardCat, released under the MIT license.

Composer logo by WizardCat, released under the MIT license.

Composer is a project-local package manager for PHP in the vein of npm, Bundler and virtualenv. By including a composer.json configuration file, we are now able to specify our dependencies programmatically and let Composer do the hard work of downloading and making the classes available to a MediaWiki install. Adding Composer support to the core product has also given us an easy avenue for including additional third-party dependencies in the future. This gives the MediaWiki developer community an opportunity to examine several of the components we have built ourselves, to determine if they could be replaced by  external projects that are better supported or that offer more compelling features.

Removing barriers to library extraction

A major roadblock to separating and publishing more libraries was that logging and profiling calls specific to MediaWiki were included everywhere in the code. We looked at the current practices of the larger PHP community and found that our homegrown code could be largely replaced with the PSR-3 logging standard and the XHProf profiling library.

The choice of PSR-3 as a logging interface was easy. The standardization efforts of PHP Framework Interop Group are being adopted by a growing number of PHP applications and libraries, so by following the standard it will be easier to decouple code from MediaWiki and still retain valuable runtime debugging capabilities. As an added bonus, it will make integrating externally-developed software with MediaWiki easier as well and opens the possibility of replacing our home-grown logging solution with another PSR-3 compliant logging implementation. The Wikimedia Foundation is currently experimenting with using the Monolog library instead of the PSR-3 port of our legacy logging system.

XHProf is a FOSS project originally developed by Facebook which is available for both PHP5 and HHVM. We chose XHProf as the backbone for our updated profiling pipeline because the measurements it collects happen at the interpreter level rather than relying on explicit invocations of start/stop methods. This systemic approach allows code to be profiled without tight coupling to the profiling system and as an added benefit is less impactful to the system being profiled which should produce slightly better measurements.

Next steps

Completing the logging and profiling changes puts MediaWiki in a good place for continued modernization. The developer community is in the process of turning the lessons learned from the initial library extractions into guidelines for how to manage new FOSS projects. We have a Composer-based infrastructure to manage re-introducing the code we extract back to the project, in a clean and well-versioned manner. Most importantly, we now have built momentum and are excited to keep moving forward.

Our next steps are to get current (and future) MediaWiki developers excited about using these tools to continue the process. We have started a list of potential projects that could be extracted with varying degrees of difficulty. Some of these might be good candidates for GSoC projects; others will take a fairly deep understanding of the current MediaWiki code base. We have also opened up new avenues for MediaWiki development. Historically sharing code between two extensions required one to depend on the other or the creation of a third extension just to hold the common code. This creates issues with the runtime load order of the extensions and may require enabling unwanted features on the wiki.  Today we can instead promote the use of semantically versioned libraries. We are already seeing development along these lines with OOjs UI being included in MediaWiki as a library and work to replace the Mantle extension with a third party template library.

Our long-term vision is for MediaWiki to become an application that is composed of many small purpose built libraries with interfaces that allow individual libraries to be exchanged for others. This will make MediaWiki as a whole easier to understand by reducing the entanglement between components. It will also make it much easier to introduce code from third parties. This “Librarization” project is a small but important way to remove technical and social hurdles for a long-term shift in development practices by the MediaWiki community.

Bryan Davis, Software Engineer, Wikimedia Foundation
Chad Horohoe, Software Engineer, Wikimedia Foundation
Kunal Mehta, Software Engineer, Wikimedia Foundation