Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

Posts by Gerard Meijssen

Lua previewed

The Berlin hackathon 2012 brought a record number of people together who worked together on many technical issues. Some people came to learn about MediaWiki, some came to learn about the finer points of Git and Gerrit. The great thing about MediaWiki hackathons is that typically there is a great mix of knowledgeable people, talented people and people who can explain and help with difficult technical issues. It is also where new technologies are previewed, this time it was Lua who was getting a lot of the limelight.
It is with pleasure to share with you with what theDJ has to say in answer to questions about the hackathon and Lua.
What is the attraction of a hackathon and, what was special about Berlin 2012

For me as a volunteer the benefit of such an event is twofold. The first part is of course getting to know the people that you usually only interact with online. It’s just more fun and the connections you build are simply stronger. It often also helps you in your future online communications with these people. When you know people in person you also tend to communicate better online.

The other reason is that it is a great way to do learning, brainstorming, rapid prototyping and getting questions asked and answered efficiently. Nothing beats being in the same room when discussing or working on a topic.

There were several themes in the presentations and workshops … you chose Lua, what is Lua and what is its relevance

The complexity of pages is actually one of our biggest performance issues right now and the [[en:Barack Obama]] page is a well known example of that. After an edit of that page it often takes well over 20 seconds for the server to render the page again. This is creating a huge resource load on the server and it is confusing the editors because it seems like the server is not responding to their edits.

The complexity is caused by two things you can use in pages: templates and parser functions. The performance of these elements is shaky, for a large part because our inventive MediaWiki users have found ingenious yet complex forms of working around the limited functionality these two elements provide.
Ideally much of the functionality would be converted in PHP MediaWiki extensions, but that development path is much slower and less accessible for MediaWiki users. For years there have been discussions in the developer community on how to tackle this problem, but a more clear consensus is starting to form now. The idea is to move away from the old templates and parserfunctions combination and replace much of it with a new type of code named Lua, which is still accessible for users, much more capable than templates and parser functions  yet much easier than PHP extensions.

Overall Lua has the promise of a much higher performance and flexibility compared to templates and parserfunctions, yet will allow us to have the same type of safeguarding at the serverside that is so important for a major website like Wikipedia.

When Lua is scheduled for 2013, why all this attention now

Exactly because it is not yet deployed yet. Right now we can still make significant changes easily without causing too much trouble for users. But to know what changes are needed, you do need to use the system and learn from that usage. By engaging the developer community to experiment with writing templates and converting templates, we can find issues that are still outstanding or that were simply never anticipated when implementing the system, before it goes into wider deployment.

Simply said, because the existing templates and parser functions that are in use right now on all these different MediaWiki’s are so complicated. It will take years to replace all the code, so in order to reap the benefits as soon as possible, you will want to tackle the most complex code that currently performs the worst early on in the conversion.

You have been converting the “coordinates” template, what is its attraction

The “Coord” template is a real life example of a template with high complexity that is used on tens of thousands of pages. Exactly the type that in theory should benefit greatly from conversion to Lua. At the same time it is still ‘small’ enough to actually get done within a reasonable amount of time. The proces of converting it instead of writing something from ‘scratch’ will likely mimic the way users will start when using the new Lua capabilities and was therefore important to test.

I have currently spent about 9 hours on it, and am probably about half way the conversion. After doing a full conversion I would like to benchmark the difference between the two implementations so we can further validate our suspicions of the real world benefits of this new Lua method. A partial conversion of the template seems to have already sped it up by at least 4x in my preliminary assessments.

How will this functionality become available on the other 270+ Wikipedias

Lua is now available on Wikimedia labs for testing and this will be followed by gradually adding mediawiki.org and other ‘low priority” production sites. There are still major parts of the extension that require attention before it is ready for a general release.

In terms of the scripts themselves the users will probably start with the most resource ‘expensive’ templates on English Wikipedia and slowly work their way trough at every time trying to keep everything as compatible with the old systems as needed.

Should we not implement the lessons of “Gadgets 2.0” and share them from a central site ?

I think having a centralized Lua module repository, similar to the central Gadget repository for Javascript that we will soon have, is something we should definitely consider. Past experiences with scripts developed by users has taught us that it is a maintenance hell because people fork and adapt the code for every single wiki. Though most of those copies are 95% the same code, they are not actually the same script and if you want to change something to them, you need to either go trough 270 wiki’s or people invest valuable time into fixing a problem that someone else has already fixed at another wiki.

For the lua modules I think it is very important to be able to share that 95% of code that will be the same on all the wiki’s. This is currently not yet possible, but has been discussed about. It is my opinion that we really need to get that working before a 2013 full deploy.

Several people were hacking Lua code, even more people attended the workshop, what is the most relevant thing for them to do moving forward

Provide feedback based on their experiences. As I see it, this is a learning stage and as a group we can only take all lessons into account if we share what each and everyone has learned.

You identified two parts to converting templates to Lua, the conversion itself and optimisation. How relevant will optimisation be?

As I said earlier, the users have found ingenious but complex ways around the limitations of templates and parser functions. A conversion is about changing from one language to the other, without change HOW the code works. This conversion will probably already provide large speed gains.
Optimizing is about getting rid of all the weird constructs that we used because we worked around the limitations of templates and parser functions. These constructs are no longer required and will actually slow down the Lua script, so you will want to remove them.

You use Lua in your day job. In what way is Lua for MediaWiki different from the Lua that you know?

Not so much actually. Of course there is the interface towards MediaWiki which is different from the interface that I work with (an interface to write mobile applications) but the language is exactly the same.

It could have been the first question, what benefit will Lua bring us

It will speed up pages, but make it possible to do even more advanced templating. At the same time it will look a bit less scary to editors, and will create more readable code that is easier to maintain.

Towards a Wikipedia for signed languages

There are more than a hundred sign languages worldwide. Almost half of thesls people who sign can hear perfectly well and for most of them it is a second language. They and the deaf people for whom spoken languages are unheard of share unique cultures and languages.

All these cultures and languages have the same issues as oral languages; they rely on the passing of knowledge from person to person, from generation to generation. The best way to preserve the language and culture is by making more permanent records, by writing things down by recording video.

I am happy to have interviewed Steve Slevinski; Steve has been responsible for much of the development that brings sign languages to computers and the Internet. One of the ambitions of the SignWriting community is to have their own Wikipedias. Steve is the man who is making this a reality.

Gerard Meijssen
Internationalization / Localization outreach consultant

Unicode technology enables SignWriting

(more…)

Niklas Laxström, language engineer and Wikimedian

University of HelsinkiThe average age of the MediaWiki developers is quite young. They often started contributing to the MediaWiki code while still in school or university. When their contributions show promise, they are sometimes asked to contribute to particular projects. This has resulted in the hiring of students and they continue to do professionally what they at first did as a hobby.

While the Wikimedia Foundation is happy with the talent it gains in this way, it feels strongly that finishing formal education is very important. Some students only work for the WMF in their holidays while others manage regular contributions in their free time as well. Such relations are often strengthened through programs like the Google Summer of Code or through summer internships.

Niklas Laxström recently finished University and this happy occasion is reason enough to interview him. As you may know, he works for the WMF Localisation Team and his claim to fame is that he started what became translatewiki.net. Niklas has been instrumental in much of the internationalisation and localisation development for the MediaWiki software.

Thanks,
Gerard Meijssen
Internationalization / Localization outreach consultant

Congratulations, master Niklas. You finished university !! What did you study and what is your exact title (in Finnish)
I studied language technology with minors in Finnish language, Computer Science, East-Asian studies and collection of Russian language courses. I’m now Master of Arts, filosofian maisteri.

You started with what became translatewiki.net before you started university. How did your study influence the development of translatewiki.net
Before university I had a hobby project for inflecting Finnish nouns. It wasn’t successful nor had it a good design, but it started series of events, which caused me to start studying language technology.

My studies were pretty heavily biased in hard language processing: for instance syntactic parsers, finite state technologies and morphologies.  however, the open source language technologies are not yet in a level where that kind of processing can just be plugged into any software.

Learning about variation in languages has been very useful to me. It helps avoiding solutions that only work for limited number of similar languages. I learned most of that in linguistics courses but also by studing several dissimilar languages. l also liked the isolated courses about copyright, terminologies and string processing, which turned out to be useful in different situations.

On the other hand, working with MediaWiki and translatewiki.net has given me enormous amounts of practical experience all over computer
engineering, which helped me to perform better in engineering related courses.

(more…)

Primary data about languages

For MediaWiki, the CLDR or Common Locale Data Repository, is a primary source of information. The information about languages Unicode maintains in this standard is what is most relevant to us. It registers its name in English, as well as the autonym or the name in its own language, as well as information like what a date and a number look like,  the script or scripts used for a language and the names of other languages in that language.

We prefer to use standardised information, not only because it is stable and reliable, but because we do not have to collect the data ourselves and also because the data is used by many other organisations and in many other applications. We love the CLDR and we want it to be even better. To make it better we need your help.

Many of the languages that have a Wikipedia and many of the languages that want to have a Wikipedia are not represented in the CLDR. Many Wikipedians know their language really well. They can provide the information about their language and they can verify that the existing information is correct. When there is a need to change things, you will need to create a user.

When a language is not yet supported, you will have to request for the new locale or language to be added. It is expected that you provide at least the core data when you make your request and that you at least complete the minimal data required. One of the questions is: where the language is official, it may be that a language does not have any official status. This does not prevent people from reading or writing that language and it does not mean that information about such a language is not important to us.

When a language is already supported, we want you to verify if the names for other languages exist and are correctly written. There can be issues in any language including English; using the Auracana name for the Mapundungun language is considered an insult.

When you are able and happy to help us in this way, you may be interested in joining our “language support team.” Because of your interest you belong to the group of people we first want to turn to when we have questions about supporting your language. More structured information and room for your reports can be found here. When there are any issues, do not hesitate to report them.

Thanks,
Gerard Meijssen
Internationalization / Localization outreach consultant

The end of the tenth sprint

Every two weeks a development sprint is finished. Every two weeks we evaluate what we achieved, what went well and what went wrong. Many of the stories of sprint 10 can be found in Mingle (user:guest, password:guest). There you see the stories that were accepted or postponed.

The stories that ended happily are all over the place.

  • The Ahirani language, a language of India that uses the Devanagari script in the same way as Marathi, is now supported for web fonts and input methods.
  • When a translation administrator encourages or discourages the translation of a text, this will now be logged. This helps translators prioritize their activities.
  • WebFonts now uses the MicroType Express font compression technology. This makes sending fonts to your browser go much faster.
  • A translator can inform how he wants to be contacted and how often he can be contacted. In true agile fashion, the software that will make use of this will be written in a future sprint
  • Some texts only need to be translated in selected languages because they will reach a specific public or because it will be used in software that supports a limited number of languages. New functionality enables a translation administrator to select these languages.
  • We did a lot of code review; it gets done as it is part of our plan

A few stories did not end on a high note:

  • Configuring one translation memory for all the wikis where the WMF needs translation took much longer. The idea was to build it first on Labs. This idea has now been shelved and it will be configured directly in production.
  • A lot of work has gone in EasyTimeline. This was to make its functionality usable in other scripts and languages that are written from right to left. It works after a fashion and many issues have been resolved. Sadly the devil is in the details. Ploticus is a dependency for EasyTimeline and it has a bugs in creating  SVG output. There is no plan to fix this bug in Ploticus ourselves, but we are trying to find developers who can. Until then, we cannot have progress on this feature. Please let us know if you are interesting this issue for us.

Thanks,
Gerard Meijssen
Internationalization / Localization outreach consultant

Fonts and their use in source texts

When a text is written, when it is printed for a first time, it will have a contemporary look. When you then look at historic texts, at a first publication, you will notice the many details that show its age. It can be in differences in orthography, differences in vocabulary and also differences in the layout, the fonts used.

When sources are published in Wikisource, maintaining the atmosphere of the original text is very important. It is why the original orthography and vocabulary are maintained and with the availability of the  WebFonts extension there is a potential to use fonts that give this impression of age.

In the Office hours of the Localisation team, the question was raised if we could support cuneiform. The answer to that was that we can when there is a freely licensed font. We found a freely licensed cuneiform font and it is made available on the Wikis that support WebFonts. The bigger question however is about all the other scripts that are of  historic significance. This is of particular relevance to the Sanskrit Wikisource; the Sanskrit language is written in many scripts and it is only recent when the Devanagari script became the default script.

For sources like the Quran maintaining the original orthography and characters is an article of faith. It is for this reason that characters were added to Unicode because alternate representations of the same characters were missing. We do have a beautiful freely license font, the Amiri font and we would love to support it in MediaWiki but we are struggling with technical issues.

For the Wikimedia Localisation team, it is impossible to identify all the needs for fonts, for historic text representation. This is why we have language support teams. They know their language, they can identify a need and hopefully they can identify usable freely licensed fonts. When they do, we can and will support fonts. In the mean time we will continue our work on a unified language selector.  This will make the use of WebFonts easy and obvious. At this time it works, but it is hard work for you as a user.

Thanks,
Gerard Meijssen
Internationalization / Localization outreach consultant

After the slush, the flood

after the slush, the flush

When new code does not find its way into production for quite some time, it tends to pile up. It is like with snow and when the time comes when it starts to thaw, it starts with a trickle, the trickles become a stream and all the streams rush down the mountain.

For the WMF Localisation team we worked on our documentation, our help system and our tests. We went to conferences in Belgium and India. And we worked on many small iterative improvements. We rolled out webfonts to more wikis. Input methods were improved and deployed as per requests. We have had our translation memory working on translatewiki.net for ages and now it is configured for use on the WMF wikis who use the Translate extension. Actually, we did experiment first with a new algorithm and we did configure one of the labs systems as a host for the memory of all the fine work we did and do.

Over time a lot of work went into things like plural rules. As the number of languages increase and as we support not only PHP but now also JavaScript, we are optimising our code and we are checking it again. We frequently find that a re-factoring is in order. It makes the code more elegant and easier to maintain. With added documentation and tests we ensure that we know it will work well.

Another fine project waiting to get to the stage where it will flow into our codebase is an updated Easy Timeline. The functionality has always been broken when used in many of  the “other” languages, languages written in a different direction, a different script.  The updated Easy Timeline has been given a revamp; it uses SVG to create the image and you can test it at translatewiki sandbox. Amir welcomes bug reports and LOVES to hear your comments

As you know, we use mingle for our project management (user guest, password guest). In it we have stories that explain the functionality that we are going to develop. Story 532 is one such:

As a potential translator, I want to be able to tell translation administrators in a structured way that I am interested in translating to one or more languages and at the same time provide them with some data about me and preferences on how and how often I would like to be contacted, so that translation administrators can more effectively and efficiently target translators.

Together with the acceptance criteria a narrative like this enables the developer to develop and the finished product to be accepted by our product manager. A story comes with tasks and once you have read the stories and the tasks you have a clue of what goes into getting you new functionality.

The conferences were great, we learn a lot from meeting so many wonderful people. Many tests are deployed and they run regularly. The documentation, including user documentation is written and we love you to translate many of them in your language. We feel really pumped up to get cracking and provide you with more functionality in the next sprint.

Thanks,
Gerard Meijssen
Internationalization / Localization outreach consultant

The #MediaWiki #hackathon in Pune, #India

When good people get together in a friendly, well organised setting like this weekend in Pune, many great things happen. Several MediaWiki developers had come to provide the many people new to MediaWiki with their expertise and guide people into its inner workings.

Many people worked on Wikimedia mobile and the SmartPhone software, others worked on MediaWiki and its extensions. Bugs got fixed and functionality got extended.

One of the surprises was two people working on the localisation for the Mongolian language. The inclusion of a web font that will support the Dzonka language is another.

Dzongkha is the official language of Bhutan and according to Ethnologue, the script used is either Tibetan script, Uchen style or the Tibetan script, Umed style. These scripts and styles are also used for the Tibetan language, it is not only Dzongkha that stands to benefit.

One of the highlights of the work on the SmartPhone app is support for scripts that are written from right to left, this is now “beta” functionality. The result of more people looking at the code was that several bugs received the attention needed to make them go away. Scrolling was one area that got attention; this results in a smoother user experience.

New input methods have been created for Punjabi transliteration and for an Gujarati input method to be included in Narayam. The continued collaboration with RedHat engineers ensures that our work benefits both MediaWiki and RedHat/Fedora. We do realise that there is still a lot to do and it is not only documentation. Additional work was done on the “visual on-screen keyboard” that was started at the previous hackathon in Pune, it still needs more testing and design work.

Thanks,
Gerard Meijssen
Internationalization / Localization outreach consultant

Getting ready for when the freeze is done

When you look at the “sprint backlog” in mingle (guest, guest), you may notice that even though we have been slowed down because of the slush, the feature freeze because of the imminent MediaWiki release, we are not sitting on our hands. Documentation, testing, code review and outreach is on our agenda.

Because of the way we are planning, it is apparent how much code review actually gets done. This sprint we added a review of the ArticleFeedback extension for its internationalization and localization aspects. This is a logical development considering that, with 280+ languages, we are not developing for one language. Our objective for this job is: “As a user I can use the functionality of the ArticleFeedbackv5 so that nothing looks odd in my language from an internationalization and localization perspective”. Reviews like this have been performed informally in the past by translatewiki.net staff. This review, however, will be done during Wikimedia hours and reported through Wikimedia channels.

One old open bug is about EasyTimeline.  It started its life in 2005 and it is finally getting the attention it deserves. The bug explains the lack of support for languages like Arabic, Hebrew and Farsi that are written from right to left. The software has Ploticus as a dependency and for a long time the waiting was for a version of this software that does support RtL languages. We are not waiting any longer and you can read in our story 230 about the complexities involved.

You could say that implementing a translation memory for page translation is a bit more adventurous; it is however debatable if that functionality is new; a translation memory has for a long time been functional at translatewiki.net. It is also very much a feature that makes people more productive. Our team has always had the goal of making life easy and productive for our editors and translators.

The “grammar” functionality for JavaScript is part and parcel of the i18n tooling for our developers. It was not ready before the “slush” and it does make our lives difficult not having it available in the code. When you are building tests for “gender” and “plural”, it is so obvious to create them for “grammar” as well. In this sprint, “grammar” will be included in the code for all these good reasons.

This is the first time that there is a story for outreach. We are reaching out to all the Wikipedia language communities to have their own language support team. It will make a difference when all our language communities have been asked to provide their expertise to us. We already have found that many people show an interest and issues do get raised as a result.

Thanks,
Gerard Meijssen
Internationalization / Localization outreach consultant

 

Tutorial for using the Translate extension

On Saturday 28 January 2012 at 20:00 UTC there will be a workshop on Translation tools. It will take between 60 and 90 minutes and will consist of an introduction of use cases and features, as well as a Q&A. (local times)

The workshop will focus on the use cases covered by the Translate extension on Wikimedia Meta-Wiki for the following user roles:

  • writers: those who write texts that need to be translated
  • translation administrators: those who mark pages for translations and post-process translations when they have been made

Please put the following page on your watchlist and write your name down if you would like to attend. The workshop is held online using WebEx. I would advise you to log in 15 minutes in advance to ensure you have ample time to set up your computer if you have not used WebEx before. WebEx can be used in desktop environments on Linux, OSX and Windows.

If you would like to familiarise yourself with the technology before the workshop, please take a look at the elaborate documentation, which includes some tutorials. In the next two weeks, the already present documentation for translators will also be completed.

Credit goes to Pete Forsyth for proposing to have this workshop. Hope to see you online Saturday!

Siebrand Mazeland
Product Manager Localisation
Wikimedia Foundation