Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

Posts Tagged ‘interview’

Lua previewed

The Berlin hackathon 2012 brought a record number of people together who worked together on many technical issues. Some people came to learn about MediaWiki, some came to learn about the finer points of Git and Gerrit. The great thing about MediaWiki hackathons is that typically there is a great mix of knowledgeable people, talented people and people who can explain and help with difficult technical issues. It is also where new technologies are previewed, this time it was Lua who was getting a lot of the limelight.
It is with pleasure to share with you with what theDJ has to say in answer to questions about the hackathon and Lua.
What is the attraction of a hackathon and, what was special about Berlin 2012

For me as a volunteer the benefit of such an event is twofold. The first part is of course getting to know the people that you usually only interact with online. It’s just more fun and the connections you build are simply stronger. It often also helps you in your future online communications with these people. When you know people in person you also tend to communicate better online.

The other reason is that it is a great way to do learning, brainstorming, rapid prototyping and getting questions asked and answered efficiently. Nothing beats being in the same room when discussing or working on a topic.

There were several themes in the presentations and workshops … you chose Lua, what is Lua and what is its relevance

The complexity of pages is actually one of our biggest performance issues right now and the [[en:Barack Obama]] page is a well known example of that. After an edit of that page it often takes well over 20 seconds for the server to render the page again. This is creating a huge resource load on the server and it is confusing the editors because it seems like the server is not responding to their edits.

The complexity is caused by two things you can use in pages: templates and parser functions. The performance of these elements is shaky, for a large part because our inventive MediaWiki users have found ingenious yet complex forms of working around the limited functionality these two elements provide.
Ideally much of the functionality would be converted in PHP MediaWiki extensions, but that development path is much slower and less accessible for MediaWiki users. For years there have been discussions in the developer community on how to tackle this problem, but a more clear consensus is starting to form now. The idea is to move away from the old templates and parserfunctions combination and replace much of it with a new type of code named Lua, which is still accessible for users, much more capable than templates and parser functions  yet much easier than PHP extensions.

Overall Lua has the promise of a much higher performance and flexibility compared to templates and parserfunctions, yet will allow us to have the same type of safeguarding at the serverside that is so important for a major website like Wikipedia.

When Lua is scheduled for 2013, why all this attention now

Exactly because it is not yet deployed yet. Right now we can still make significant changes easily without causing too much trouble for users. But to know what changes are needed, you do need to use the system and learn from that usage. By engaging the developer community to experiment with writing templates and converting templates, we can find issues that are still outstanding or that were simply never anticipated when implementing the system, before it goes into wider deployment.

Simply said, because the existing templates and parser functions that are in use right now on all these different MediaWiki’s are so complicated. It will take years to replace all the code, so in order to reap the benefits as soon as possible, you will want to tackle the most complex code that currently performs the worst early on in the conversion.

You have been converting the “coordinates” template, what is its attraction

The “Coord” template is a real life example of a template with high complexity that is used on tens of thousands of pages. Exactly the type that in theory should benefit greatly from conversion to Lua. At the same time it is still ‘small’ enough to actually get done within a reasonable amount of time. The proces of converting it instead of writing something from ‘scratch’ will likely mimic the way users will start when using the new Lua capabilities and was therefore important to test.

I have currently spent about 9 hours on it, and am probably about half way the conversion. After doing a full conversion I would like to benchmark the difference between the two implementations so we can further validate our suspicions of the real world benefits of this new Lua method. A partial conversion of the template seems to have already sped it up by at least 4x in my preliminary assessments.

How will this functionality become available on the other 270+ Wikipedias

Lua is now available on Wikimedia labs for testing and this will be followed by gradually adding mediawiki.org and other ‘low priority” production sites. There are still major parts of the extension that require attention before it is ready for a general release.

In terms of the scripts themselves the users will probably start with the most resource ‘expensive’ templates on English Wikipedia and slowly work their way trough at every time trying to keep everything as compatible with the old systems as needed.

Should we not implement the lessons of “Gadgets 2.0” and share them from a central site ?

I think having a centralized Lua module repository, similar to the central Gadget repository for Javascript that we will soon have, is something we should definitely consider. Past experiences with scripts developed by users has taught us that it is a maintenance hell because people fork and adapt the code for every single wiki. Though most of those copies are 95% the same code, they are not actually the same script and if you want to change something to them, you need to either go trough 270 wiki’s or people invest valuable time into fixing a problem that someone else has already fixed at another wiki.

For the lua modules I think it is very important to be able to share that 95% of code that will be the same on all the wiki’s. This is currently not yet possible, but has been discussed about. It is my opinion that we really need to get that working before a 2013 full deploy.

Several people were hacking Lua code, even more people attended the workshop, what is the most relevant thing for them to do moving forward

Provide feedback based on their experiences. As I see it, this is a learning stage and as a group we can only take all lessons into account if we share what each and everyone has learned.

You identified two parts to converting templates to Lua, the conversion itself and optimisation. How relevant will optimisation be?

As I said earlier, the users have found ingenious but complex ways around the limitations of templates and parser functions. A conversion is about changing from one language to the other, without change HOW the code works. This conversion will probably already provide large speed gains.
Optimizing is about getting rid of all the weird constructs that we used because we worked around the limitations of templates and parser functions. These constructs are no longer required and will actually slow down the Lua script, so you will want to remove them.

You use Lua in your day job. In what way is Lua for MediaWiki different from the Lua that you know?

Not so much actually. Of course there is the interface towards MediaWiki which is different from the interface that I work with (an interface to write mobile applications) but the language is exactly the same.

It could have been the first question, what benefit will Lua bring us

It will speed up pages, but make it possible to do even more advanced templating. At the same time it will look a bit less scary to editors, and will create more readable code that is easier to maintain.

Niklas Laxström, language engineer and Wikimedian

University of HelsinkiThe average age of the MediaWiki developers is quite young. They often started contributing to the MediaWiki code while still in school or university. When their contributions show promise, they are sometimes asked to contribute to particular projects. This has resulted in the hiring of students and they continue to do professionally what they at first did as a hobby.

While the Wikimedia Foundation is happy with the talent it gains in this way, it feels strongly that finishing formal education is very important. Some students only work for the WMF in their holidays while others manage regular contributions in their free time as well. Such relations are often strengthened through programs like the Google Summer of Code or through summer internships.

Niklas Laxström recently finished University and this happy occasion is reason enough to interview him. As you may know, he works for the WMF Localisation Team and his claim to fame is that he started what became translatewiki.net. Niklas has been instrumental in much of the internationalisation and localisation development for the MediaWiki software.

Thanks,
Gerard Meijssen
Internationalization / Localization outreach consultant

Congratulations, master Niklas. You finished university !! What did you study and what is your exact title (in Finnish)
I studied language technology with minors in Finnish language, Computer Science, East-Asian studies and collection of Russian language courses. I’m now Master of Arts, filosofian maisteri.

You started with what became translatewiki.net before you started university. How did your study influence the development of translatewiki.net
Before university I had a hobby project for inflecting Finnish nouns. It wasn’t successful nor had it a good design, but it started series of events, which caused me to start studying language technology.

My studies were pretty heavily biased in hard language processing: for instance syntactic parsers, finite state technologies and morphologies.  however, the open source language technologies are not yet in a level where that kind of processing can just be plugged into any software.

Learning about variation in languages has been very useful to me. It helps avoiding solutions that only work for limited number of similar languages. I learned most of that in linguistics courses but also by studing several dissimilar languages. l also liked the isolated courses about copyright, terminologies and string processing, which turned out to be useful in different situations.

On the other hand, working with MediaWiki and translatewiki.net has given me enormous amounts of practical experience all over computer
engineering, which helped me to perform better in engineering related courses.

(more…)

Interview with Wikimedia’s Amir Aharoni

If there is one thing that makes the Localisation team special, then it is that all the team members were collaborating before they were hired by the Wikimedia Foundation. Continuing in this spirit of “never change a winning team” we are happy with the addition of Amir Aharoni as a developer to the team. We know him well, we worked well together in the past. As they say in insurance: results from the past do not predict future results but this is people and Amir is a great guy.

Thanks,
Gerard Meijssen
Internationalization / Localization outreach consultant

You are a specialist on RtL languages. How many people start writing on the right and does this not smear the ink when you move while writing to the left ?

These are actually two questions :)

I’ll start with the second: no, it doesn’t smear the ink. I don’t remember that it ever bothered me, and to make sure i just tested it on a piece of paper. Several times i read the claim that that was the reason the Greeks switched to writing left-to-right when they adapted the Phoenician alphabet (the older version of Hebrew) to their language, but unless I am missing something that just doesn’t seem to be a good reason.

How many people start writing on the right? It’s very hard to answer that question precisely. The biggest languages that are written right-to-left are Arabic, Urdu, Pashto, Persian, Sindhi, Kashmiri, Azeri – all written with different varieties of the Arabic alphabet; Hebrew and Yiddish, written with the Hebrew alphabet; and Mandinka and Dyula, written with the N’Ko alphabet. (Other important right-to-left languages are Syriac and Divehi, but they have relatively few speakers.)

If you sum up the number of the speakers of all these languages, you’ll arrive at about 400 million people. That’s a large number, but it’s also a very, very rough estimation. First, i didn’t count all the relevant languages; second, many people who speak these languages live in countries with low literacy rates; and finally, many people speak and write some of them as their second language. 

Are there benefits writing from the right to the left ?

To say the truth – no, not really. But there is a benefit in the fact that it exists. I like the general idea of diversity. Direction of writing is just one of those things that shows that very few things in life can be taken for granted, like electricity plug shape, time zone, sexual orientation, taste in food, appearance of things to colour-blind people and, well, almost anything else. And that’s a mighty good thing.

In both Arabic and Hebrew it is possible to indicate what vowels
are used. What is the rationale for not including them per standard?

The simplest answer is “People’s customs”.

When people started writing Arabic and Hebrew, they didn’t write as much as we do today and the variety of words was not so great, so they could easily guess the needed vowels just by looking at the consonants. Our writing today is much more varied and it makes guessing the vowels harder, but the custom to omit them is still there.

I don’t know about Arabic, but there were suggestions to write Hebrew always with the vowels. The most notable suggestion to do this was made in the 1930s by Hayim Nahman Bialik, the most prominent Hebrew poet in the twentieth century and the president of the Hebrew Language Committee. Despite Bialik’s status, this proposal was never implemented, among other things because of the technical challenges in printing books with so many diacritics. And, as much as i love them, i must also admit that writing them all slows down hand writing quite considerably.

There is also the problem of the many differences between the vowel marks and the actual spoken language. Modern Hebrew has five vowel sounds, but over ten vowel marks. In Arabic it goes the other way: it has only three vowel marks, but there are more than three vowel sounds in the spoken language, and it also changes from region to region. So very often a lot of people don’t really know which vowel marks they should write even when they want to write them. This also creates an opportunity for patronizing: knowing the right grammar makes one feel smarter than others and unfortunately some people exploit it in ways that are not very constructive.

There are, however, scripts, the structure of which is quite similar to that of Arabic and Hebrew, and which do indicate all the vowels in writing. The most prominent example is Divehi, whose script is a derivative of Arabic. The scripts of India and Southeast Asia are somewhat comparable as well. My guess (and hope) is that in the near future the modern technology will make writing Arabic and Hebrew with vowels easier, if not universal.

You are a member of the Wikimedia Israel board. Do you think working for the WMF creates a conflict of interest?

I find it hard to think about anything that can create a conflict of interest here. I discussed it with the other Board members and they couldn’t think of anything either.

If anybody does think that it is a problem, i’ll be very glad to hear about it.

Do all mobile phones sold in Israel support the Hebrew script and is combining it with the Latin script possible?

Most phones sold in Israel do support it; they also support right-to-left display and even the support for mixing Hebrew and English in one SMS message is reasonable. I don’t know whether the regulations require it or whether it’s just a matter of demand.

Some people buy themselves fancy smartphones abroad and these don’t always support the Hebrew script.

How do you cope when the Ivrit script is not supported?

Writing Hebrew in Latin transliteration is not quite common and most Israelis know at least some English, so a person who cannot write in Hebrew for any reason would probably just use English. That includes myself; using transliteration would be better in principle, but a lot of people would find it harder to read.

I should also add that when i bought my first mobile phone in the year 2000, Hebrew support in cellphones was still new and uncommon. I could pick between two models – one with Hebrew and one without. I picked the one with Hebrew, even though it cost about a $100 more and it was long before i cared about software localization as much as i do today. I did it simply because it made much more sense to write names of friends in the contact list in the Hebrew script. 

You have a competency in many languages because of your study as a linguist. Can you indicate how different the five languages families are?

The differences i notice are mostly in the grammar features, some of which are very prominent in some languages and hardly existing in others. For example, in Russian, when you say that you read a book, it’s essential to say whether you finished it or only read a part of it. In Hebrew and Arabic a root of word is an abstract unpronounceable sequence of consonants and the actual words are created by inserting vowel sounds between them – this concept sounds quite crazy to speakers of European languages.

In Hebrew and Arabic there’s a strong formal distinction between verbs that describe things that a person does to oneself and things done to others. In Romance languages, like Italian and Catalan, the subjunctive mood is very prominent – it’s essential to indicate whether a person  did something or would do it; this distinction is less essential in English and it hardly exists in Hebrew. That, i’d say, is not just a matter of grammar, but also a way to think about things, but that’s a hard philosophical issue that is very hard to test. Finally, Malayalam has a word order that is logical in itself, but very unusual to somebody who speaks a European language. 

You can write in at least five scripts. What script do you consider the most usable.

I suppose that the five scripts you refer to are Cyrillic, Latin, Hebrew, Arabic and Malayalam. (I can read Ethiopic, too. And, well, Greek, but that’s really not a big deal.)

The most usable is Cyrillic, of course, closely followed by Latin. It’s hard to be objective in such a case, because Russian is my native language, but i really think that it is has the best balance between simplicity, size (slightly over 30 letters), completeness and being fit for the languages it is supposed to represent. (I’ll try to balance my natural bias towards Russian by saying that the Russian orthography is actually relatively outdated and relatively harder than the orthography of other languages using Cyrillic, like Belarusian or Kyrgyz.

Latin is a close second, because in general it is very similar to Cyrillic, but its actual pronunciation and usage, as well names of letters, differ wildly between languages.

I love Hebrew, but i’ll be the first to acknowledge it’s disadvantages. Malayalam, though beautiful to behold, is rather hard to grasp, but once you get the hang of it, it does convey all the needed sounds well.

How do you want to put your stamp on the Localisation team
Except my expertise in Middle Eastern scripts, i hope to influence it in the areas of usability and testing. I don’t claim to be a usability expert, but i care very strongly about it and i want to know that the users of the software i create are actually able to use it. I also believe that all features of software localization must be thoroughly tested; it’s costly and challenging, but important and that’s why i hope to find the time to formulate localization testing policy.

One other and somewhat more personal thing that i hope to achieve through my work in this team is spreading the word about the Software Localization Paradox. 

Your wife is also a dancer, did you ever come across dancewriting

I asked her and she says that it’s a cool idea, but too complicated to learn, and that in the age when phones come with reasonable video cameras it’s easier to just film the moves.

Dancing is her (very important) hobby and her main fields of work are Neuroscience and Physics. That involves a lot of math formulas and in my perspective, what she says about DanceWriting, could well be said about math.

–  Amir