Wikimedia blog

News from inside the Wikimedia Foundation.org

Archive for November, 2011

Helping new editors by responding to their feedback

Recently the tech team at WMF has deployed a couple experimental tools for gathering feedback on the experiences of new editors. With MoodBar, new editors can quickly and easily provide feedback on their editing experience by entering a 140 character comment. All these comments are posted as a feed on the Feedback Dashboard. Since then, over 8,500 of pieces of feedback have been created by thousands of users. You can watch stats roll in real-time on this report.

The Feedback Dashboard shows data provided by new editors through the MoodBar feature.

Today we are introducing new functionality that will enable experienced editors to easily respond to this feedback. Experienced editors who want to help new editors through their initial few edits may now respond in-line without leaving the dashboard. The new editor (who left the initial comment about being happy, sad, or confused) will then receive the reply on their talk page. This feature will make it easier for experienced users to lend a helping hand to new users, guiding them through their initial experiences editing Wikipedia.

It is now possible for more experienced users to respond to MoodBar messages directly from the Feedback Dashboard.

Steven Walling and Maryana Pinchuk have also started a Response Team of experienced editors willing to help out. So far, over 30 editors have joined. If you’re an editor, please consider helping out by signing up on the Response Team page.

Steven and Maryana will also be holding an “office hours” this Sunday, December 4th, for anyone interested in learning more about how to respond to new editor feedback and discussing the feature. If you’re interested, please attend!

Howie Fung and Brandon Harris
Wikimedia Foundation Tech Team

Wikimedia Research Newsletter, November 2011

WRN header.png

Vol: 1 • Issue: 5 • November 2011 [archives] Syndicate the Wikimedia Research Newsletter feed

Quantifying quality collaboration patterns, systemic bias, POV pushing, the impact of news events, and editors’ reputation

With contributions by: Tbayer, Hfordsa, DarTar and Romanesco

Contents

Collaboration pattern analysis: Editor experience more important than “many eyes”

One of the motifs indicating article quality: One editor (top) having worked on several related articles (bottom)

A paper titled “Characterizing Wikipedia Pages Using Edit Network Motif Profiles”[1] by three researchers from University College Dublin indicates that the quality of a Wikipedia article can be predicted from characteristics of its “edit network” – a graph derived from the collaboration of Wikipedians in that area. Network motifs are small graphs which occur particularly frequently as sub-graphs of networks of a certain kind, and can be regarded as its building blocks in some sense. (The concept is popular in bioinformatics, where it is applied to gene regulatory networks.) In this paper, the authors use graphs with at most five nodes consisting of users and articles, which are connected by an edge if the user has edited the article – giving 17 possible “Wikipedia network motifs”. (Anonymous users are disregarded.) For a Wikipedia article, the researchers form an “ego network” consisting of that article, articles which link to it (and have been edited by at least one of the users who edited the core article), and the users who edited them. For a sample of around 2000 articles from the History and United States categories, the frequencies of the 17 “Wikipedia network motifs” in those article’s “ego networks” were calculated.

Using machine learning techniques, the researchers are able to discern with some certainty articles of basic quality (defined as having been assessed as Start class by Wikipedians) from those of good quality (defined as Featured or B class), solely based on this set of motif frequencies in the article’s edit network. Looking at the impact of each of the 17 types separately, they found that “all network motifs have some potential to discriminate between good and basic Wikipedia articles” in the sample, but that among the four best predicting motifs, three are “stars with editors at their centre”:

“This is interesting because it shows that many eyes is not really the defining characteristic of quality; instead experience is important – the editors should have worked on many other articles.”

(more…)

Supporting the languages of India

India is different. Given that India is very strategic for the Wikimedia Foundation, the question is what can we do to raise the profile of our projects and what can we do to support the Indic language effectively.

Many well educated people, people with a university level education are effectively illiterate in their own language. For them a Wikipedia in their own language does not tempt them to get involved. They do not have the skills even though it would not be that hard for them to learn to read and write their mother tongue. What really helps is that writing the Indic languages is helped in two ways; the scripts are really phonetic and InScript, the dominant keyboard layout for Indic languages, ensures that the same sound is always in the same place.

When our goal is to get more people involved in the Indic languages, we can ask people to transcribe the scans of public domain books. We will be providing them with a keyboard mapping, the fonts that show their language. As these “illiterates” recognise the characters and reproduce them digitally, they learn not only to type their language they may even learn to read. When we recognise their effort in a thank you note accompanying the book, experience teaches us they are likely to help us in future projects.

The project that is already making a big impact in India in this way is the Malayalam Wikisource project.They published a CD with a years worth of sources and distributed it to the schools of Kerala. They produce software that ensures that the content looks really good. The software as well as the content is available on the internet but sadly this full experience can not be had on Wikisource itself.

When a new book becomes available, the Malayalam press mentions this often in their periodicals so much so that Wikisource is mentioned more often in the press than Wikipedia.

 

 

Similar projects for other Indic languages have been a popular topic at the WikiConference India; it was discussed at least for Sanskrit and Tamil. The discussion was not only about the organisation of such a project but also about internationalising the software that prepares the final product and about using Kiwix for presenting it. When you consider how much literature is available in the Indic languages that is already in the public domain, this is a project that will run and run.

Preparing sources in Wikibooks or Wikisource in a collaborative way makes sense in a Wiki. Once the work is done however publishing the content can be in all kinds of formats. This is important because we do want it to be read as widely as possible because this is how we optimally realise our objectives.

Jimmy is right when he said in his speech that the Indic language communities can learn from each other and do really well. However these best practices can be applied to any Wikisource or Wikibooks.

Thanks,
Gerard Meijssen
Internationalization / Localization outreach consultant

American Sociological Association launches Wikipedia Initiative

The American Sociological Association (ASA) has started a “Wikipedia Initiative”, inviting its members to enhance Wikipedia articles, and to incorporate Wikipedia editing into their classroom teaching, in collaboration with the Wikimedia Foundation’s higher education program. In a feature article for the scholarly society’s monthly newsletter, ASA president Erik Olin Wright described the initative as a “call to duty”.


Follow this series of brief news on enhancing Wikipedia participation via RSS, on Tumblr, or on Wikipedia.

Do It Yourself Analytics with Wikipedia

As you probably know, we publish on a regular basis backups of the different Wikimedia projects, containing their complete editing history. As time progresses, these backups grow larger and larger and become increasingly harder to analyze. To help the community, researchers and other interested people, we have developed a number of analytic tools to assist you in analyzing these large datasets. Today, we want to update you about these new tools, what they do and where you can find them. And please remember they are all still in development:

  • Wikihadoop
  • Diffdb
  • WikiPride

Wikihadoop

Wikihadoop makes it possible to use MapReduce jobs using Hadoop on the compressed XML dump files. What this means is that we can embarrassingly easy parallelize the processing of our XML files and this means that we don’t have to wait for days or weeks to finish a job.

We used Wikihadoop to create the diffs for all edits from the English XML dump that was generated in April of this year.

DiffDB

DiffIndexer and DiffSearcher are the two components of the DiffDB. The DiffIndexer takes as raw input the diffs generated by Wikihadoop and creates a Lucene-based index. The DiffSearcher allows you to query the index so you can answer questions such as:

  • Who has added template X in the last month?
  • Who added more than 2000 characters to user talk pages in 2008?

WikiPride

Volume of contributions by registered users on the English Wikipedia until December 2010, colored by account age

Finally, WikiPride allows you to visualize the breakdown of a Wikipedia community by age of account and by the volume of contributed content. You need a Toolserver account to run this, but you will be able to generate cool charts.

If you are having trouble getting Wikihadoop to run, then please contact me at dvanliere at wikimedia dot org and I am happy to point you in the right direction! Let the data crunching begin!

Diederik van Liere, Analytics Team

You have new messages: improving communication on Wikipedia

You have new messages
Every month, hundreds of thousands of people press the edit button on Wikipedia for the very first time. And for many of these new users, the first (and sometimes only) message that appears on their user talk page is a template rather than a human response. This is especially true on our larger, older projects.

User talk page templates were developed by the community because of the tremendous volume of contributions that began pouring in as Wikipedia grew more and more popular. Today, with the focus of our movement shifting to openness and attracting new editors, it’s time to rethink the message we’re sending via templates.

That’s why Steven Walling and I have started a project to A/B test many of the template messages received by new users, such as warnings and deletion notices. In collaboration with over 20 members of the Wikimedia community, including the English and Portuguese Wikipedias so far, we’ve designed a number of experiments that will give us tangible data to improve communication on the projects.

How it works

With the help of tools developed by our summer researchers, different messages we want to test are randomly delivered to different groups of users. Tracking the data from these two groups, we can assess the efficacy of different kinds of messages, based on whether users continue to edit constructively after receiving them.

Our working hypothesis, which we are continuing to test and refine, is that making templates more personal will help retain the good-faith editors who receive them, while continuing to detract vandals, spammers, and other bad-faith editors. For both groups, showing them that the encyclopedia is built through the hard work of other people like them is key.

What you can do

There are thousands of different user talk page templates on Wikimedia projects. We need your help to construct and carry out more tests, especially in non-English communities!

Please visit our task force page on English Wikipedia or our interlanguage hub on Meta and sign up. You can add your project to the list if you’re interested in starting new tests.

This is the first time that the Wikimedia Foundation has devoted resources to helping test and improve the template infrastructure the community uses every day to function. We hope that together, we can significantly improve the way Wikimedia projects communicate with editors.

Thank you,
Maryana Pinchuk and Steven Walling

The Mumbai hackathon was sweet

When a hackathon is organised, it is wonderful when the reality of the results exceeds expectations. The reality was that some of India’s best and brightest attended the hackathon. They represented many of the languages  of India, and it showed.

Seven Indians and a German created an input method for their language. A Russian keyboard method is promised for the next day. There was a jQuery wizard who created a wonderful and necessary addition to the Narayam extension: a visual cue to where the characters are on the keyboard. This information comes directly from the Narayam definitions and the best part is that the visual cue actually works as well.

The WebFonts extension got its reality check. WebFonts provides default fonts in order to ensure that nobody sees the infamous Unicode squares and numbers instead of the desired characters. The MediaWiki software is exclusively open source, and consequently the fonts we deliver through the WebFonts extension need to be freely licensed, too.  The default font we use for the Indic languages is the Lohit font produced by Red Hat. It was quite astonishing to learn that some of the characters are not what the character should look like. Bugs have been filed for this at Red Hat and more work will be done.

We are going to roll out the WebFonts extension on December 12th. Our aim is to install it on the Indic projects. When we have freely licensed fonts that show languages correctly, we will finally be able to provide readable content to everyone. We will be working towards resolving the issues identified at the hackathon.

The Mumbai hackathon has also been good for the Kiwix off-line reader; not only was the software localised into several languages, new developers also familiarized themselves with the software itself to implement further improvements. This is quite important because many Indian people have no or intermittent access to the Internet. In addition to Wikipedia content, there are many projects in India to transcribe books that are in the public domain; as the Kiwix software gets ready to support this content, it will help more and more people get access to India’s rich cultural heritage.

Mobile support was the third centre of gravity; many first-time Wikimedia hackers teamed up with seasoned Wikimedia developers and this produced great results. This included work on a mobile landing page for India, as well as a gateway that allows users to receive Wikipedia articles over SMS and the carrier-specific USSD technology. To appreciate this, many people do not have access to the Internet and consequently to our content. Work also continued on the “Wikipedia Zero” project, which aims to bring Wikipedia and other Wikimedia content to millions of users without data charges.

We also saw an interesting connection with the October 2011 Coding Challenge. Developer Yuvipanda implemented Android 2.2 support for one of the coding challenge submissions, the “Share with Wikimedia Commons” Android app (as well as for the official Wikipedia Android app).

All this will get some review, maybe some polishing but we are quite eager to bring this functionality to you.

Many of the hackers were new to MediaWiki. With an introduction by Erik and private tutoring by Sumana, Tomasz, Patrick, and others, several people really got into the swing of things to the extent that some bugs were smashed.  The hackathon proved as always that when you bring great people together special things can and do happen.

Thanks,
Gerard Meijssen
Internationalization / Localization outreach consultant

Hackathon Mumbai has started

 Concurrent with the WikiConference India a hackathon has been organised. At the Mumbai hackathon many Wikimedia developers are present but there are many, many more Indian developers. The one thing that is quite funny is that when you ask them “what language do you speak”, they say that it is English. Only when you ask “do you speak any other languages?” you learn “eh, Hindi, Marathi, Tamil..”

Obviously a hackathon is not only for language support, far from it, but there will be a lot of development on the things that tie in with the functionality developed by the Localisation team for MediaWiki like input methods, web fonts and maybe even transliterations between the scripts used by languages like Konkani or Panjabi.

Hackathons are powerful; they help raise awareness that there is not only an “edit button” but that you can also work on the code and help determine what MediaWiki and consequently Wikipedia may be.

Thanks,

Gerard Meijssen
Internationalization / Localization outreach consultant

 

Nobody notices when it’s not broken: New database servers deployed

The Technical Operations team has just completed behind-the-scenes work that will likely never be noticed by our readers.

Our External Storage databases hold the text for every version of every wiki page; they have slowly grown over the life of Wikipedia and its sister projects. Ten years is a lifetime on the Internet, and the incremental changes that were made to our external storage system over that period, though appropriate at the time they were made, resulted in a setup that was a challenge to maintain and which was becoming unreliable.

Graph of query durationWe spent a few weeks analyzing all the various servers across which the page text data was spread, in order to gather it all together onto a single host. From there, it could be replicated onto newer, more reliable and higher performance hardware. Along the way, we found and fixed a number of inconsistencies to make the dataset more regular.

The deployment of the new hardware lasted a few days (as we moved things piece by piece) and was finished this past Monday with no fanfare. There was a brief (about 10 minute) period during which articles were unable to be edited while we switched writes to the new hardware. The end result is a barrage of small improvements, all of which together make for a happy TechOps team:

  • average query duration has dropped from about 15ms to around 8ms and the worst case from 576ms down to 60ms;
  • replication and failover processes are now well known and standardized;
  • total hardware used has dropped from around 30 servers to 8, now in two locations;
  • hosts no longer double up as web servers and database servers for text; dedicated servers are used for the database.

It’s a small victory in the battle against entropy, but an important prerequisite for carrying out our mission of providing unfettered and reliable access to the sum of all knowledge.

Ben Hartshorne, Operations Engineer

Wikipedia Education Program by the numbers

The Wikipedia Education Program has grown by leaps and bounds since its inception last year, as part of the Public Policy Initiative. In 2011, the program ventured beyond the United States into Canada and India, making the measurements of the program’s impact even more important. We want to use these metrics (some of which are outlined below) as tools that help us understand and improve the Wikipedia Education Program as a whole, while also understanding individual pieces of the system better.

a. Fall 2011 Numbers and Growth

b. Gender Representation

c. Wikipedia Education Program Metrics and Activities Meeting

(more…)