Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

Posts Tagged ‘open-source’

What Lua scripting means for Wikimedia and open source

Yesterday we flipped a switch: editors can now use Lua, an innovative programming language, to generate sections of wiki pages on all our sites. We’d like to talk about what this means for the open source community at large, for Wikimedians, and for our future.

Why we did this

In the old wikitext templating system, this is part of Template:Citation/core. Any Wikipedia article citing a source will cause our CPUs to run through this instructionset. With Lua, we’ll be able to replace this.

When we started digging into the causes of slow pageload times a few years ago, we saw that our CPUs ate a lot of time interpreting templates — useful bits of markup that programmatically told MediaWiki to reuse little bits of text. Templates are everywhere on our sites. Good Wikipedia articles heavily use the citation templates, for instance, and you’ve seen the ubiquitous infoboxes on every biography. In fact, editors can write code to generate substantial portions of wiki pages. Hit “View source” sometime to see how.

But, because we’d never planned for wikitext to become a programming language, these templates were terribly inefficient and hacky — they didn’t even have recursion or loops — and were terrible for performance. When you edit a complex article like Tulsi Gabbard, with scores of citations, it can take up to 30 seconds to parse and display the page. Even as we worked to improve performance via caching, query profiling, new hardware, and other common means, we sometimes had to advise our community to remove functionality from a particular template so pages would render faster.

This wouldn’t do. It was a terrible experience for our users and especially hard for our editors, who had to wait for a multi-second roundtrip after every “how would this page look?” preview.

So our staffers and volunteers worked on Scribunto (from the Latin for “they shall write”), a MediaWiki extension to allow editors to embed Lua scripts instead of wikitext for templating. And volunteers and Foundation staffers have already started identifying pages that are slow to render and converting the most inefficient templates. We have 488,731 templates on English Wikipedia alone right now. The process of turning many of those into Lua scripts is going to affect everyone who reads our sites — and the Scribunto project has already started giving back to the Lua community.

Us and Lua

For instance, our engineer Brad Jorsch wrote mw.ustring.lua, a Unicode module reusable by other Lua developers. This library is good news for people who write templates in non-Latin characters, and for anyone who wants a version of Lua’s standard String library where the methods operate on characters in UTF-8 encoded strings rather than bytes.

And with Scribunto, we empower those frustrated Wikimedians who have been spending years breaking their knuckles making amazing things in wikitext; as they learn how much easier it is to script in Lua, we hope they’ll be able to use those skills in their hobbies, schools, and workplaces. They’ll join forces with the graduates of Codecademy, World of Warcraft, and the other communities that teach anyone to program. New programmers with basic knowledge of computer science who want to do something real with their new skills will find that Lua scripting on Wikimedia sites is a logical next step for them. Our implementation only differs slightly from standard Lua.

And since Scribunto is an extension that any MediaWiki administrator can install, we hope the MediaWiki administrators out there will enjoy using Lua to more easily customize their wikis for their users.

Structured data and new ways to display it

Scribunto lays the foundations for exciting work to come when the Wikidata structured data project comes further online (the Wikidata interface is still in development and being deployed in phases). We know that Lua will be an attractive way to integrate Wikidata information into pages, and we hope a lot of (currently) unstructured data will get structured, helping new applications emerge.

Now that Lua and Wikidata are more mature, we can look forward to enabling more functionality and plugging in more libraries. And as we continue deploying Wikidata, people will make interesting improvements that we currently can’t predict. For instance, right now, each citation is hard to programmatically dissect; the Cite template takes many unstructured parameters (“author1,” “author2,” etc.) We structure these arguments by convention, but the data’s not structured as CS folks would have it, and can’t be queried via APIs, remixed, and so on.

Excerpt of Coordinates module

A screenshot of part of the new Coordinates module, written in Lua by User:Dragons flight. Note that, with Lua, we can actually use proper conditionals.

But in the future, we could have citations stored in Wikidata and then put together onto article pages using Lua, or even assembled into other various reasonable forms (automatically generated bibliographies?) using Lua, and it will be more easy for Zotero users to discover. That’s just one example; on all our sites over the next few years, things will change from the status quo in a user-visible way. The old math and geography templates were inefficient and hard to hack; once rewritten, they’ll run faster and perhaps editors will use them more. We might see galleries, automatic data analyses, better annotated maps, and various other interesting processes and queries embedded in Wikimedia pages.

Open for change

Wikimedians have been writing wikitext templates for years, and doing hard, astounding, unexpected things with them for readers to enjoy. But the steep learning curve drove contributors away. With Lua, a genuine programming language, people now have a deeper and more useful foundation to build upon. And for years, power users on our sites have customized their experiences with JavaScript/CSS Gadgets and user scripts, but those are basically one level above skins preferences; other people won’t stumble upon your hacks in the process of reading an article.

So, now is the first time that the Wikimedia site maintainers have enabled real coding that affects all readers. We’re letting people program Wikipedia unsupervised. Anyone can write a chunk of code to be included in an article that will be seen by millions of people, often without much review. We are taking our “anyone can edit” maxim one big step forward.

If someone doesn’t like the load time of a webpage, they can now actually improve it themselves. Just as we crowdsourced building Wikipedia, now we’re crowdsourcing bits of infrastructure improvement. And this kind of massively multiplayer, crowdsourced performance improvement is uniquely us.

Wikitext templates could do a lot of things, but Lua does them better and faster, and now mere mortals can do it. We’re aiming to help our users learn to program, to empower themselves, and to help each other and help our readers.

We hope you’ll join us.

Sumana Harihareswara, Engineering Community Manager

Google Summer of Code students reach project milestones

This year, the MediaWiki community again participated in Google Summer of Code, in which we selected nine students to work on new features or specific improvements to the software. They were sponsored by Google and mentored by experienced developers, who helped them become part of the development community and guided their code development.

Congratulations to the eight students who have made it through the summer of 2012 (our seventh year participating in GSoC)! They all accomplished a great deal, and many of them are working to improve their projects to benefit the Wikimedia community even more.

Google Summer of Code 2012

Eight students passed MediaWiki’s GSoC program in 2012.

  • Ankur Anand worked on integrating Flickr upload and geolocation into UploadWizard. WMF engineer Ryan Kaldari mentored Ankur as they made it easier for Wikimedia contributors to contribute media files and metadata. Read his wrapup and anticipate the merge of his code into the main UploadWizard codebase.
  • Harry Burt worked on TranslateSvg (“Bringing the translation revolution to Wikimedia Commons”). When his work is complete and deployed, we will more easily able to use a single picture or animation in different language wikis. See this image of the anatomy of a human kidney, for example; it has a description in eight languages, so it benefits multiple language Wikipedias (e.g., Spanish and Russian). Harry aims to allow contributors to localize the text embedded within vector files (SVGs), and you can watch a demo video, try out the test site, or just read Harry’s wrapup post. WMF engineer Max Semenik mentored this project.
  • Akshay Chugh worked on a convention/conference extension for MediaWiki. Wikimedia conferences like Wikimania often use MediaWiki to help organize their conferences, but it takes a lot of custom programming. Under the mentorship of volunteer developer Jure Kajzer, Akshay created the beta of an extension that a webmaster could install to provide conference-related features automatically. See his wrapup post.
  • Flow diagram for Ashish Dubey's project

    Ashish is working on the architecture that will support real-time collaboration.

    Ashish Dubey worked on realtime collaboration in the upcoming Visual Editor (you may have seen “real-time collaborative editing” in tools like Etherpad and Google Docs). Ashish (with WMF engineer Trevor Parscal as mentor) has implemented a collaboration server and other features (see his wrapup post) and has achieved real-time “spectation,” in which readers can see an editor’s changes in realtime. Wikimedia Foundation engineers plan to integrate Ashish’s work into VisualEditor around April to June 2013.

  • Nischay Nahata optimized the performance of the Semantic MediaWiki extension. In wikis with unusually large amounts of content, Semantic MediaWiki experiences performance degradation. With the mentorship of head Semantic MediaWiki developer Markus Krötzsch (a volunteer) and Wikidata developer Jeroen De Dauw, Nischay found and fixed many of these issues.  This also reduces SMW’s energy consumption, making it greener. Nischay’s work will be in Semantic MediaWiki 1.8.0, which is currently in beta and due to be released soon. Wikimedia Labs uses Semantic MediaWiki and will benefit from the performance improvements.
  • Proposal to redesign the MediaWiki watchlist

    Arun Ganesh illustrated the watchlist redesign proposal with this mockup.

    Aaron Pramana worked on watchlist grouping and workflow improvements. Aaron wants to make it easier for wiki editors and readers to use watchlists, and to create and use groups of watched items to focus on or share. Aaron worked with volunteer developer Alex Emsenhuber. The back end of the system is done, but Aaron wants your input about the user interface. Folks on the English Wikipedia’s Village Pump have started discussing it.

  • Robin Pepermans worked on Incubator improvements and language support, mentored by WMF engineer Niklas Laxström. If you’ve ever thought of using Wikimedia’s Incubator for new projects, it’s now easier to get started. Read Robin’s wrapup post for more.
  • Platonides worked on a desktop application for mass-uploading files to Wikimedia Commons. The application will eventually make it much easier for participants in upload campaigns like Wiki Loves Monuments to upload their photos (and it’ll work on Windows, Linux, and Mac OS). I mentored Platonides, who delivered a beta version.

As further progress happens, we’ll update our page about past GSoC students. Congratulations again to the students and their mentors. And thanks to volunteer Greg Varnum, who helped me administer this year’s GSoC, and to all the staffers and volunteers who helped students learn our ways.

Sumana Harihareswara, Engineering Community Manager

Meet the Analytics Team

Over the past few months, the Wikimedia Foundation has been gearing up a variety of new initiatives, and measuring success has been on our minds. It should come as no surprise that we’ve been building an Analytics Team at the same time. We are excited to finally introduce ourselves and talk about our plans.

The team is currently a pair of awesome engineers, David Schoonover and Andrew Otto, a veteran data-analyst, Erik Zachte, and one humble product manager, Diederik van Liere. (We happen to be looking for a JavaScript engineer — if beautiful, data-driven client apps are your thing – or you know someone, drop us a line!)

We’ve got quite a few projects under way (and many more ideas), and we’d like to briefly go over them — expect more posts in the future with deeper details on each.

First up: a revamp of the Wikimedia Report Card. This dashboard gives an overview of key metrics representing the health and success of the movement: pageviews, unique visitors, number of active editors, and the like.

Illustration of the revamped Reportcard

The new report card is powered by Limn, a pure JavaScript GUI visualization toolkit we wrote. We wanted non-technical community members to be able to interact with the data directly, visualizing and exploring it themselves, rather than relying on us or analysts to give them a porthole into the deep. As a drop-in component, we hope it will contribute to democratizing data analysis (though we plan to use it extensively across projects ourselves). So play around with the report card data, or fork the project on GitHub!

Kraken: A Data Services Platform

But we have bigger plans. Epic plans. Mythical plans. A generic computational cluster for data analytics, which we affectionately call Kraken: a unified platform to aggregate, store, analyze, and query all incoming data of interest to the community, built so as to keep pace with our movement’s ample motivation and energy.

How many Android users are there in India that visit more than ten times per month? Is there a significant difference in the popularity of mobile OS’s between large cities and rural areas of India? Do Portuguese and Brazilian readers favour different content categories? How often are GLAM pictures displayed off-site, outside of Wikipedia (and where)?

As it stands, answering any of these questions is, at best, tedious and hard. Usually, it’s impossible. The size of the success of Wikimedia projects is a double-edged sword, in that it makes even modest data analysis a significant task. This is something we aim to fix with Kraken.

More urgently, however, we don’t presently have infrastructure to do A/B testing, measure the impact of outreach projects, or give editors insight into the readers they reach with their contributions. From this view, the platform is a robust, unified toolkit for exploring these data streams, as well as a means of providing everyone with better information for evaluating the success of features large and small.

This points toward our overarching vision. Long-term, we aim to give the Wikimedia movement a true data services platform: a cluster capable of providing realtime insight into community activity and a new view of humanity’s knowledge to power applications, mash up into websites, and stream to devices.

Dream big!

Privacy: Counting not Tracking

The Kraken is a mythical Nordic monster with many tentacles, much like any analytics system: analytics touches everything — from instrumenting mobile apps to new user conversion analysis to counting parser cache lookups — and it needs a big gaping maw to keep up with all the data coming in. Unfortunately, history teaches us that mythical cephalopods aren’t terribly good at privacy. We aim to change that.

We’ve always had a strong commitment to privacy. Everything we store is covered by the Foundation’s privacy policy. Nothing we’re talking about here changes those promises. Kraken will be used to count stuff, not to track user behaviour. But in order to count, we need to store and we want you all to have a good idea of what we’re collecting and why we’re collecting it and we will be specific and transparent about that. We aim to be able to answer a multitude of questions using different data sources. Counts of visitors, page and image views, search queries and number of edits and new user registrations are just a few of the data streams currently planned; each will be annotated with metadata to make it easier to query. To take a few more examples: page views will be tagged to indicate which come from bots. Traffic from mobile phones will be tagged as mobile. By counting these different types of events and adding these kind of meta tags, we will be able to better measure our progress towards the Strategic Plan.

We’ll be talking a lot more about the technical details of the system we’re building, so check back in case you’re interested or reach out to us if you want to provide feedback about how to best use the data to answer lots of interesting questions while still preserving users’ privacy. This post only scratches the surface, but we’ve got lots more to discuss.

Talk to Us!

Sound exciting? Have questions, ideas, or suggestions? Well then! Consider joining the Analytics mailing list or #wikimedia-analytics on Freenode (IRC). And of course you’re also very welcome to send me email directly.

Excited, and have engineering chops? Well then! We’re looking for a stellar engineer to help build a fast, intuitive, and beautiful toolkit for visualizing and understanding all this data. Check out the Javascript/UI Engineer job posting to learn more.

We’re definitely excited about where things are going, and we are looking forward to keeping you all up to speed on all our new developments.

Finally, we are hosting our first Analytics IRC office hours! Join us on July 30th, at 12pm PDT (3pm EDT / 9pm CEST) in #wikimedia-analytics to ask all your analytics and statistics related questions about Wikipedia and the other Wikimedia projects.

Best regards,

David Schoonover, Analytics Engineer
Andrew Otto, Analytics Engineer
Erik Zachte, Data Analyst
Diederik van Liere, Product Manager

Analyzing Mobile Browser Energy Consumption

Recently, technology reporter Jacob Aron wrote a blog post on newscientist.com that talks about how bloated website code drains your smartphone’s battery.

He mentions how Stanford computer scientist Narendran Thiagarajan and colleagues used an Android phone hooked up to a multimeter to measure the energy used in downloading and rendering popular websites. Using their experimental setup they measured the energy needed to render popular web sites as well as the energy needed to render individual web elements such as images, Javascript, and Cascading Style Sheets (CSS). They claim that complex Javascript and CSS can be as expensive to render as images. Moreover, dynamic Javascript requests (in the form of XMLHttpRequest) can greatly increase the cost of rendering the page, since it prevents the page contents from being cached. Finally, they show that on the Android browser, rendering JPEG images is considerably cheaper than other formats, such as GIF and PNG for comparably sized images.

One example that is cited is that simply loading the mobile version of Wikipedia over a 3G connection consumed just over 1 per cent of the phone’s battery, while browsing apple.com, which does not have a mobile version, used 1.4 per cent.
Yet, in the summary of the paper they find that the results from this study are not meaningful except for the initial loading of just a single page resource. It would be interesting to extend these results in a meaningful way, and study the energy signature of an entire browsing session at a site such as Wikipedia, where a user typically moves from page to page. So, during that session, downloaded web elements such as Javascript, CSS and images would mostly be cached locally. Therefore, we really can’t estimate the energy cost of a total session by simply summing the energy usage of pages visited during that session. Measuring an entire typical session may help optimize the power signature of the entire site. Custom CSS that is applicable to every page of a site would easily outweigh the cost of the apparently excessive CSS download for the render of just the first page.
So, one of the ways that we are looking to improve our mobile browser energy consumption is by implementing the MediaWiki ResourceLoader in order to improve the load times for JavaScript and CSS. ResourceLoader is the delivery system in MediaWiki for the optimized loading and managing of modules. Its purpose is to improve MediaWiki’s front-end performance and the experience by making use of strong caching while still allowing near-instant deployment of new code that all clients start using within 5 minutes. Modules are built of JavaScript, CSS and interface messages; it was first released in MediaWiki 1.17.
On Wikimedia wikis, every page view includes hundreds of kilobytes of JavaScript. In many cases, some or all of this code goes unused due to browser support or because users do not make use of the features on the page. In these cases, bandwidth and loading time spent on downloading, parsing and executing JavaScript code are wasted. This is especially true when users visit MediaWiki sites using older browsers, like Internet Explorer 6, where almost all features are unsupported, and parsing and executing JavaScript is extremely slow.
ResourceLoader solves this problem by loading resources on demand and only for browsers that can run them. Although there is too much to summarize in a simple list, the major improvements for client-side performance are gained by:
  • Minifying and concatenating
  • → which reduces the code’s size and parsing/download time
  • JavaScript files, CSS files and interface messages are loaded in a single special formatted “ResourceLoader Implement” server response.
  • Batch loading
  • → which reduces the number of requests made
  • The server response for module loading supports loading multiple modules so that a single response contains multiple ResourceLoader Implements, which in itself contain the minified and concatenated result of multiple javascript/css files.
  • Data URIs embedding
  • → which further reduces the number of requests, response time and bandwidth
  • Optionally images referenced in stylesheets can be embedded as data URIs. Together with the gzippping of the server response, those embedded images, together, function as a “super sprite”.

Patrick Reilly, Senior Software Developer, Mobile

Wikimedia Foundation selects nine students for summer software projects

We received 63 proposals for this year’s Google Summer of Code, and several mentors put many hours into evaluating project ideas, discussing them with applicants and making the tough decisions. We’re happy to announce our final choices, the Google Summer of Code students for 2012:

MediaWiki logo

All nine of these students are working on MediaWiki, the software that powers Wikimedia sites.

Congratulations to this year’s students, and thanks to all the applicants, as well as MediaWiki’s many mentors, developers who evaluated applications, and Google’s Open Source Programs Office. The accepted students now have a month to ramp up on MediaWiki’s processes and get to know their mentors (the Community Bonding Period) and will start coding their summer projects on or before May 21st. As the organizational administrator for MediaWiki’s GSoC participation, I’ll be keeping an eye on all nine students and helping them out.

Good luck!

Sumana Harihareswara, Volunteer Development Coordinator

Google Summer of Code 2012

Google Summer of Code 2012

Project ideas, students, and mentors wanted to improve Wikimedia tech this summer

Google Summer of Code 2012

Google Summer of Code 2012

For the seventh year in a row, Wikimedia Foundation is participating in the Google Summer of Code program. Google Summer of Code (GSoC) is a program where Google pays summer students USD 5000 each to code for open source projects for three months (read more).

We hope 2012′s students will develop useful chunks of MediaWiki, help us get their code shipped, and fall in love with our community such that they stay with us for years to come.

This year’s project ideas include improvements to CentralNotice, taxobox editing, search, translation tools, and more.  Interested?

University, community college, and graduate students around the world are eligible to apply to Google Summer of Code. You don’t need to be a computer science or IT major, and you can work from home.

MediaWiki logo

MediaWiki is the Wikimedia Foundation's key open source project, powering Wikipedia and our other sites.

We are looking for students who already know some PHP. We also strongly prefer for you to have some experience working with Linux, Apache, and MySQL environments, and with the Git version control system. If you haven’t contributed to MediaWiki before, How to become a MediaWiki hacker is a good place to start; we will strongly prefer candidates who submit patches before the April 6th GSoC application deadline.

If you’d like to participate, check out the timeline. Make sure you are available full-time from 21 May till 20 August 2012, and have a little free time from 23 April till 20 May for ramp-up. Please read our wiki page and start talking with us on IRC in #mediawiki on Freenode about a possible project.  Then you’ll write a proposal and submit it via the official GSoC website. The deadline for you to submit a project proposal is April 6th, but we encourage you to start early and talk with us about your idea first.

We’re also seeking experienced MediaWiki developers anywhere in the world to help select and mentor student projects. We’ll take you even if you live in the southern hemisphere and it’s not summer for you. :-) You’ll need to be available online consistently so you can respond to student questions between now and late August. As Brion Vibber put it, if you “are knowledgeable about MediaWiki — not necessarily knowing every piece of it, but knowing where to look so you can help the students help themselves” then please consider helping out.

I’m administering our participation in GSoC. So I am encouraging students to apply, getting project ideas, and managing the application process overall. I look forward to seeing students discover the joy of collaborative work that improves the Wikimedia experience for millions of users. Help us spread the word.

Sumana Harihareswara
Volunteer Development Coordinator, Wikimedia Foundation
MediaWiki Coordinator, GSoC 2012

Wikimedia engineering moving from Subversion to Git

Hello, MediaWiki developers and users! You may already be aware of this: our community is embarking on a journey to leave Subversion behind and migrate to Git for our source code repositories, starting on March 3rd (update as of February 29th: moved to March 21st). This is not an easy task. Here I’ll outline our rationale for this move, as well as our planned process.

What is Git?

Git is a distributed version control system originally developed by Linus Torvalds and others to manage the Linux kernel. In the past couple of years, it has taken off as a very robust and well-supported code repository. “Distributed” means that there is no central copy of the repository. With Subversion, Wikimedia’s servers host the repository and users commit their changes to it. In contrast, with Git, once you’ve cloned the repository, you have a fully functioning copy of the source code, with all the branches and tagged releases at your disposal.

Why switch?

Three major reasons:

To encourage participation: Since Git is distributed, it allows people to contribute with a much lower barrier to entry. Anyone will be able to clone the repository and make their own changes to keep track of them. And if you’ve got an account in our code review tool (Gerrit), you’ll be able to push changes for the wider community to review.

To fix our technical process: Subversion has technical flaws that make life difficult for developers. Notably, the implementation of branching is not very easy to use, and makes it hard to use “feature branches”. Our community is very distributed, with many parallel efforts and needs to integrate many different feature efforts, so we’d like to use feature branches more. Git branches are very easy to work with and merge between, which should make things easier for our development community.  (Several other large projects, such as Drupal and PostgreSQL, have made the same switch for similar reasons, and we’ve done our best to learn from their experiences.)

Some quotes from our community:

“I love git just because it allows me to commit locally (and offline).” – Guillaume Paumier

“[Y]ou can create commits locally and push them to the server later (great for working without wifi), you can tell it ‘save my work so I can go do something else now’ in one command, and it’ll allow us to review changes before they go into “trunk” (master)…. without human intervention in merging things into trunk. Gerrit automates this process.” – Roan Kattouw

And finally, to get improvements to users faster: with better branching and a more granular code review workflow that suits our needs better, plus our ongoing improvements to our automated testing infrastructure, we won’t have to wait months before deploying already-written features and bugfixes to Wikimedia sites.

We had years of discussion before we finally decided to switch, but now we can look forward to more flexibility and power in our engineering processes.

What are we doing?

We’ve now done almost all the back-end work of preparing our repository for the move and are in the final steps of preparation (details). We’ve also written explanations of the new workflow, the migration schedule, issues yet to be addressed, and other related topics. Right now, we’re asking people to stop creating any new extensions in Subversion right now, and to watch the wikitech-l mailing list for more updates.

What are the next steps?

Over the next two and a half weeks, the Git repository that contains MediaWiki core and extensions will be brought in step with Subversion, and at first it will be read-only (no one will be able to push changes). This will allow developers to start cloning it to their local machines and getting used to things.

For MediaWiki core and for extensions that the Wikimedia Foundation deploys on its wikis, the switchover is pencilled in for the weekend of March 3rd (Update as of February 29th: the new migration date is Wednesday, March 21st). We’ll do core first, and then extensions after, but hopefully all in the same weekend. After the successful migration, the Subversion repository (for the directories that have moved to Git, such as /trunk/phase3/) will be made read-only.

See the full schedule.

I develop for a Wikimedia project. Do I have to switch to Git?

Only two projects are affected immediately: the core of MediaWiki and the extensions that get deployed on Wikimedia Foundation projects.

So, if you work on an extension that the Wikimedia Foundation does not use, or on a non-MediaWiki project hosted at svn.wikimedia.org, you have more time to decide. Talk it over with your community and decide whether you would like to move to Git immediately, move to Git sometime over the next several months, or move to another hosting provider sometime before mid-2013. We would like to gradually migrate all projects currently on Wikimedia’s Subversion repository so that we can make all of svn.wikimedia.org read-only by the middle of 2013, and thus only have to support one source control infrastructure.

More details.

Will training and documentation be available? When?

Yes, we will provide training and documentation to help you use the new workflow. Check our Git page and its links now, and watch that space! There will be more documentation as well as some interactive training sessions before the big switchover in early March.

If you have any questions, please ask in #mediawiki on Freenode or on wikitech-l.  Thank you!

Chad Horohoe
Git migration lead
Platform Engineering department
Wikimedia Foundation

Sumana Harihareswara
Volunteer Development Coordinator
Platform Engineering department
Wikimedia Foundation

Free software community shares lessons learned in “Open Advice” book

Open Advice book cover

The "Open Advice" book is available for free download, or purchase as print from lulu.com.

The Open Advice book, a collection of essays, stories and lessons learned by members of the Free Software community, is out!

The book was just announced at FOSDEM, the Free and Open Source Software Developers’ European Meeting, in Brussels over the week-end.

About 50 authors from many different projects of the free software community were brought together by Lydia Pintscher, the book’s editor, who started the project in early 2011.

A year and 380 pages later, the book is now available, and tries to provide an answer to the question: What’s the key thing you would have liked to know when you started contributing?

Authors answer that question for many topics, ranging from “Writing patches” to “Documentation for Novices”, to business models, conferences, translation, design, and more.

I contributed “Learn from your users”, a chapter on user experience and usability testing. You’ll also recognize other names from the Wikimedia community, like Evan Prodromou, Markus Krötzsch and Felipe Ortega.

You can learn more about the book and the authors on the book’s website.

All the content of the book is released under the same license as Wikipedia, the Creative Commons Attribution Share-Alike license.

Check it out! You can download the book for free as a PDF file, order a print from lulu.com if you prefer paper books, or fork the text on GitHub.

I hope you’ll like the book, and it’ll prove useful, whether you’re new to the world of software, or you’re a seasoned contributor already.

Guillaume Paumier
Technical Communications Manager

Announcing the WikiChallenge Winners

Wikipedia Participation Challenge

Over the past couple of months, the Wikimedia Foundation, Kaggle and ICDM organized a data competition. We asked data scientists around the world to use Wikipedia editor data and develop an algorithm that predicts the number of future edits, and in particular predicts correctly who will stop editing and who will continue to edit.

The response has been great! We had 96 teams compete, comprising in total 193 people who jointly submitted 1029 entries. You can have a look for yourself at the leaderboard.

We are very happy to announce that the brothers Ben and Fridolin Roth (team prognoZit) developed the winning algorithm. It is elegant, fast and accurate. Using Python and Octave they developed a linear regression algorithm. They used 13 features (2 are based on reverts and 11 are based on past editing behavior) to predict future editing activity. Both the source code and the wiki description of their algorithm are available. Congratulations to Ben and Fridolin!

Second place goes to Keith Herring. Submitting only 3 entries, he developed a highly accurate model, using random forests, and utilizing a total of 206 features. His model shows that a randomly selected Wikipedia editor who has been active in the past year has approximately an 85 percent probability of being inactive (no new edits) in the next 5 months. The most informative features captured both the edit timing and volume of an editor. Asked for his reasons to enter the challenge, Keith named his fascination for datasets and that

“I have a lot of respect for what Wikipedia has done for the accessibility of information. Any small contribution I can make to that cause is in my opinion time well spent.”

We also have two Honourable Mentions for participants who only used open source software. The first Honorable Mention is for Dell Zang (team zeditor) who used a machine learning technique called gradient boosting. His model mainly uses recent past editor activity.

The second Honourable Mention is for Roopesh Ranjan and Kalpit Desai (team Aardvarks). Using Python and R, they developed a random forest model as well. Their model used 113 features, mainly based on the number of reverts and past editor activity, see the wikipage describing their model.

All the documentation and source code has been made available, the main entry page is WikiChallenge on Meta.

What the four winning models have in common is that past activity and how often an editor is reverted are the strongest predictors for future editing behavior. This confirms our intuitions, but the fact that the three winning models are quite similar in terms of what data they used is a testament to the importance of these factors.

We want to congratulate all winners, as they have showed us in a quantitative way important factors in predicting editor retention. We also hope that people will continue to investigate the training dataset and keep refining their models so we get an even better understanding of the long-term dynamics of the Wikipedia community.

We are looking forward to use the algorithms of Ben & Fridolin and Keith in a production environment and particularly to see if we can forecast the cumulative number of edits.

Finally, we want to thank the Kaggle people for helping in organizing this competition and our anonymous donor who has generously donated the prizes.

Diederik van Liere
External Consultant, Wikimedia Foundation

Howie Fung
Senior Product Manager, Wikimedia Foundation

2011-10-26: Edited to correct description of the winning algorithm

Google Summer of Code students reach project milestones

Congratulations to the seven Google Summer of Code students who made it through the summer of 2011! They all accomplished a great deal, but want to continue contributing to ensure their work maximally benefits Wikimedia.

Google Summer of Code logo 2011

MediaWiki participated in Google Summer of Code 2011.

Yuvi Panda‘s assessment parsing/aggregating extension aims “to make it easier to select and export article selections for various offline collections.” Yuvi needs some code review and suggestions on how to improve it to meet the Foundation’s quality standards for deployability, as he wrote the developers’ mailing list.

Salvatore Ingala worked on making gadgets customizable. As he elaborated, that means:

  • “allowing gadgets to easily declare the list of configuration
    variables they have;
  • allowing users to easily change those settings, with an easy-to-use
    UI integrated to the Special:Preferences page.”

The next step is merging his code into trunk, which Salvatore’s planning with other MediaWiki developers.

Kevin Brown created the ArchiveLinks project to address the problem of linkrot on Wikipedia:

In articles we often cite or link to external URLs, but anything could happen to content on other sites — if they move, change, or simply vanish, the value of the citation is lost. ArchiveLinks rewrites external links in Wikipedia articles, so there is a ‘[cached]‘ link immediately afterwards which points to the web archiving service of your choice. This can even preserve the exact time that the link was added, so for sites which archive multiple versions of content (such as the Internet Archive) it will even link to a copy of the page that was made around the time the article was written.

Kevin’s next step: getting a security review of his code, getting a starter feed set up so that the Internet Archive can start archiving it, and campaigning to interest Wikimedians and thus eventually get consensus to turn it on. At least one Wikimedian has already praised Kevin for his work.

Akshay Agarwal wrote a MediaWiki extension, SignupAPI, that makes it easier for a new user to create an account. “This extension creates a special page that cleans up SpecialUserLogin from signup related stuff, adds an API for signup, adds sourcetracking for account creation & provides Ajax-ified validation for signup form.” Akshay’s waiting for code review and discussion before the project can move forward further and benefit Wikimedia users.

MediaWiki logo

Seven students contributed to various parts of MediaWiki, the wiki software that supports WMF sites.

Yuvi, Salvatore, Kevin, and Akshay all worked on features that they aim to get into Wikimedia Foundation-run wikis, such as Wikipedia, Wikisource, Wikinews, etc., sooner rather than later. In contrast, three students worked on extensions that will primarily benefit the larger MediaWiki community. For example, Yevhenii Vlasenko‘s project was a “UserStatus” feature for SocialProfile. The SocialProfile extension is not currently deployed on any WMF wikis, but will benefit several other MediaWiki administrators and users. Zhenya finished his work but would like to continue by integrating better with social networks.

And two students worked on Semantic MediaWiki, which is also not currently deployed on any Wikimedia Foundation sites. Devayon Das made a “QueryCreator” and other improvements, and hopes to simplify its layout, make its interface easier to use, and add some features. And Ankit Garg worked on “Semantic Schemas”.

Congratulations to the students and their mentors.  Here’s hoping they’re all here to help out when next year’s interns roll in! :-)  And I’m looking forward to meeting Kevin and Salvatore, and introducing them to other Wikimedia & MediaWiki developers, at the New Orleans developers’ meetup next month.

Sumana Harihareswara
Volunteer Development Coordinator
Wikimedia Foundation