Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

Posts Tagged ‘i18n’

First Look at the Content Translation tool

The projects in the Wikimedia universe can be accessed and used in a large number of languages from around the world. The Wikimedia websites, their MediaWiki software (bot core and extensions) and their growing content benefit from standards-driven internationalization and localization engineering that makes the sites easy to use in every language across diverse platforms, both desktop and and mobile.

However, a wide disparity exists in the numbers of articles across language wikis. The article count across Wikipedias in different languages is an often cited example. As the Wikimedia Foundation focuses on the larger mission of enabling editor engagement around the globe, the Wikimedia Language Engineering team has been working on a content translation tool that can greatly facilitate the process of article creation by new editors.

About the Tool


The Content Translation editor displaying a translation of the article for Aeroplane from Spanish to Catalan.

Particularly aimed at users fluent in two or more languages, the Content Translation tool has been in development since the beginning of 2014. It will provide a combination of editing and translation tools that can be used by multilingual users to bootstrap articles in a new language by translating an existing article from another language. The Content Translation tool has been designed to address basic templates, references and links found in Wikipedia articles.

Development of this tool has involved significant research and evaluation by the engineering team to handle elements like sentence segmentation, machine translation, rich-text editing, user interface design and scalable backend architecture. The first milestone for the tool’s rollout this month includes a comprehensive editor, limited capabilities in areas of machine translation, link and reference adaptation and dictionary support.

Why Spanish and Catalan as the first language pair?

Presently deployed at http://es.wikipedia.beta.wmflabs.org/wiki/Especial:ContentTranslation, the tool is open for wider testing and user feedback. Users will have to create an account on this wiki and log in to use the tool. For the current release, machine translation can only be used to translate articles between Spanish and Catalan. This language pair was chosen for their linguistic similarity as well as availability of well-supported language aids like dictionaries and machine translation. Driven by a passionate community of contributors, the Catalan Wikipedia is an ideal medium sized project for testing and feedback. We also hope to enhance the aided translation capabilities of the tool by generating parallel corpora of text from within the tool.

To view Content Translation in action, please follow the link to this instance and make the following selections:

  • article name – the article you would like to translate
  • source language – the language in which the article you wish to translate exists (restricted to Spanish at this moment)
  • target language – the language in which you would like to translate the article (restricted to Catalan at this moment)

This will lead you to the editing interface where you can provide a title for the page, translate the different sections of the article and then publish the page in your user namespace in the same wiki. This newly created page will have to be copied over to the Wikipedia in the target language that you had earlier selected.

Users in languages other than Spanish and Catalan can also view the functionality of the tool by making a few tweaks.

We care about your feedback

Please provide us your feedback on this page on the Catalan Wikipedia or at this topic on the project’s talk page. We will attempt to respond as soon as possible based on criticality of issues surfaced.

Runa Bhattacharjee, Outreach and QA coordinator, Language Engineering, Wikimedia Foundation

MediaWiki localization file format changed from PHP to JSON

Translations of MediaWiki’s user interface are now stored in a new file format—JSON. This change won’t have a direct effect on readers and editors of Wikimedia projects, but it makes MediaWiki more robust and open to change and reuse.

MediaWiki is one of the most internationalized open source projects. MediaWiki localization includes translating over 3,000 messages (interface strings) for MediaWiki core and an additional 20,000 messages for MediaWiki extensions and related mobile applications.

User interface messages originally in English and their translations have been historically stored in PHP files along with MediaWiki code. New messages and documentation were added in English and these messages were translated on translatewiki.net to over 300 languages. These translations were then pulled from MediaWiki websites using LocalisationUpdate, an extension MediaWiki sites use to receive translation updates.

So why change the file format?

The motivation to change the file format was driven by the need to provide more security, reduce localization file sizes and support interoperability.

Security: PHP files are executable code, so the risk of malicious code being injected is significant. In contrast, JSON files are only data which minimizes this risk.

Reducing file size: Some of the larger extensions have had multi-megabyte data files. Editing those files was becoming a management nightmare for developers, so these were reduced to one file per language instead of storing all languages in large sized files.

Interoperability: The new format increases interoperability by allowing features like VisualEditor and Universal Language Selector to be decoupled from MediaWiki because it allows using JSON formats without MediaWiki. This was earlier demonstrated for the jquery.18n library. This library, developed by Wikimedia’s Language Engineering team in 2012, had internationalization features that are very similar to what MediaWiki offers, but it was written fully in JavaScript, and stored messages and message translations using JSON format. With LocalisationUpdate’s modernization, MediaWiki localization files are now compatible with those used by jquery.i18n.

An RFC on this topic was compiled and accepted by the developer community. In late 2013, developers from the Language Engineering and VisualEditor teams at Wikimedia collaborated to figure out how MediaWiki could best be able to process messages from JSON files. They wrote a script for converting PHP to JSON, made sure that MediaWiki’s localization cache worked with JSON, updated the LocalisationUpdate extension for JSON support.

Siebrand Mazeland converted all the extensions to the new format. This project was completed in early April 2014, when MediaWiki core switched over to processing JSON, creating the largest MediaWiki patch ever in terms of lines of code. The localization formats are documented in mediawiki.org, and MediaWiki’s general localization guidelines have been updated as well.

As a side effect, code analyzers like Ohloh no longer report skewed numbers for lines of PHP code, making metrics like comment ratio comparable with other projects.

Work is in progress on migrating other localized strings, such as namespace names and MediaWiki magic words. These will be addressed in a future RFC.

This migration project exemplifies collaboration at its best between many MediaWiki engineers contributing to this project. I would like to specially mention Adam Wight, Antoine Musso, David Chan, Ed Sanders, Federico Leva, James Forrester, Jon Robson, Kartik Mistry, Niklas Laxström, Raimond Spekking, Roan Kattouw, Rob Moen, Sam Reed, Santhosh Thottingal, Siebrand Mazeland and Timo Tijhof.

Amir Aharoni, Interim PO and Software Engineer, Wikimedia Language Engineering Team

Modernising MediaWiki’s Localisation Update

Interface messages on MediaWiki and its many extensions are translated into more than 350 languages on translatewiki.net. Thousands of translations are created or updated each day. Usually, users of a wiki would have to wait until a new version of MediaWiki or of an extension is released to see these updated translations. However, webmasters can use the LocalisationUpdate extension to fetch and apply these translations daily without having to update the source code.

LocalisationUpdate provides a command line script to fetch updated translations. It can be run manually, but usually it is configured to run automatically using cron jobs. The sequence of events that the script follows is:

  1. Gather a list of all localisation files that are in use on the wiki.
  2. Fetch the latest localisation files from either:
    • an online source code repository, using https, or
    • clones of the repositories in the local file system.
  3. Check whether English strings have changed to skip incompatible updates.
  4. Compare all translations in all languages to find updated and new translations.
  5. Store the translations in separate localisation files.

MediaWiki’s localisation cache will automatically find the new translations via a hook subscribed by the LocalisationUpdate extension.

Until very recently the localisation files existed in PHP format. These are now converted to JSON format. This update required changes to be made in LocalisationUpdate to handle JSON files. Extending the code piecemeal over the years had made the code base tough to maintain. The code has been rewritten with extensibility to support future development as well as to retain adequate support for older MediaWiki versions that use this extension.

The rewrite did not add any new features except support for JSON format. The code for the existing functionality was refactored using modern development patterns such as separation of concerns and dependency injection. Unit tests were added as well.

The configuration format for the update scripts changed, but most webmasters won’t need to change anything, and will be able to use the default settings. Changes will be needed only on sites that for some reason don’t use the default repositories.

New features are being planned for future versions that would optimise LocalisationUpdate to run faster and without any manual configuration. Currently, the client downloads the latest translations for all extensions in all languages and then compares which translations can be updated. By moving some of the complex processing to a separate web service, the client can save bandwidth by downloading only updated messages for specific updated languages used by the reader.

There are still more things to improve in LocalisationUpdate. If you are a developer or a webmaster of a MediaWiki site, please join us in shaping the future of this tool.

Niklas Laxström and Runa Bhattacharjee, Language Engineering, Wikimedia Foundation

Webfonts: Making Wikimedia projects readable for everyone

Wikimedia wikis are available in nearly 300 languages, with some of them having pages with mixed-script content. An example is the page on the writing systems of India on the English Wikipedia. We expect users to be able to view this page in full and not see meaningless squares also known as tofu. These tofu squares represent letters written in the language, but cannot be rendered by the web browser on the reader’s computer. This may happen due to several reasons:

  • The device does not have the font for the particular script;
  • The operating system or the web browser do not support the technology to render the character;
  • The operating system or the browser support the script partially. For instance, due to gradual addition of characters in recent Unicode versions for several scripts, the existing older fonts may not be able to support the new characters.

Fonts for most languages written in the Latin script are widely available on a variety of devices. However, languages written in other scripts often face obstacles when fonts on operating systems are unavailable, outdated, bug-ridden or aesthetically sub-optimal for reading content.

Using Webfonts with MediaWiki

To alleviate these shortcomings, the WebFonts extension was first developed and deployed to some wikis in December 2011. The underlying technology provides the ability to download fonts automatically to the user if they are not present on the reader’s device, similar to how images in web pages are downloaded.

The old WebFonts extension was converted to the jquery.webfonts library, which was included in the Universal Language Selectorthe extension that replaced the old WebFonts extension. Webfonts are applied using the jquery.webfonts library, and on Wikimedia wikis it is configured to use the fonts in the MediaWiki repository. The two important questions we need answered before this can be done are:

  1. Will the user need webfonts?
  2. If yes, which one(s)?

Webfonts are provided when:

  • Users have chosen to use webfonts in their user preference.
  • The font is explicitly selected in CSS.
  • Users viewing content in a particular language do not have the fonts on their local devices, or the devices do not display the characters correctly, and the language has an associated default font that can be used instead. Before the webfonts are downloaded, a test currently known as “tofu detection” is done to ascertain that the local fonts are indeed not usable. The default fonts are chosen by the user community.

Webfonts are not applied:

  • when users choose not to use webfonts, even if there exists a valid reason to use webfonts (see above);
  • in the text edit area of the page, where the user’s preference or browser settings are honored.

See image (below) for a graphical description of the process.

‘Tofu’ Detection

The font to be applied is chosen either by the name of the font-family or as per the language, if the designated font family is not available. For the latter, the default font is at the top of the heap. However, negotiating more complex selection options like font inheritance, and fallback add to the challenge. For projects like Wikimedia, selecting appropriate fonts for inclusion is also of concern. The many challenges include the absence of well-maintained fonts, limited number of freely licensed fonts and rejection of fonts by users for being sub-optimal.

Challenges to Webfonts

Merely serving the webfont is not the only challenge that this technology faces. The complexities are compounded for languages of South and South-East Asia, as well as Ethiopia and few other scripts with nascent internationalization support. Font rendering and support for the scripts vary across operating system platforms. The inconsistency can stem from the technology that is used like the rendering engines, which can display widely different results across browsers and operating systems. Santhosh Thottingal, senior engineer for Wikimedia’s Language Engineering team who has been participating in recent developments to make webfonts more efficient, outlines this in greater detail.

Checkbox in the Universal Language Selector preferences to download webfonts

A major impact is on bandwidth consumption and on page load time due to additional overhead of delivering webfonts for millions of users. A recent fallout of this challenge was the change that was introduced in the Universal Language Selector (ULS) to prevent pages from being loaded slowly, particularly when bandwidth is a premium commodity. A checkbox now allows the users to determine if they would like webfonts to be downloaded.

Implementing Webfonts

Several clever solutions are currently in use to avoid the known challenges. The webfonts are prepared with an aim to create comparatively smaller footprints. For instance, Google’s sfntly tool that uses MicroType Express for compression is used for creating the fonts in EOT format (WOFF being the other widely used webfont format). However, the inherent demands of a script with larger character sets cannot always be overridden efficiently. Caches are used to reduce unnecessary webfonts downloads.

FOUT or Flash Of Unstyled Text is an unavoidable consequence when the browser displays text in dissimilar styling or no text at all, while waiting for the webfonts to load. Different web browsers handle this differently while optimizations are in the making. A possible solution in the near future may be the introduction of the in-development WOFF2 webfonts format that is expected to further reduce font size, improve performance and font load events.

Special fonts like the Autonym font are used in places where known textlike a list of language namesis required to be displayed in multiple scripts. The font carries only the characters that are necessary to display the predefined content.

Additional optimizations at this point are directed towards improving the performance of the JavaScript libraries that are used.

Conclusion

Several technical solutions are being explored within Wikimedia Language Engineering and in collaboration with organizations with similar interests. Wikipedia’s sister project Wikisource attempts to digitize and preserve copyright-expired literature, some of which is written in ancient scripts. In these as well as other cases like accessibility support, webfonts technology allows fonts for special needs to be made available for wider use. The clear goal is to have readable text available for all users irrespective of the language, script, device, platform, bandwidth, content and special needs.

For more information on implementing webfonts in MediaWiki, we encourage you to read and contribute to the technical document on mediawiki.org

Runa Bhattacharjee, Outreach and QA coordinator, Language Engineering, Wikimedia Foundation

Language Engineering Events – Language Summit, Fall 2013

The Wikimedia Language Engineering team, along with Red Hat, organised the Fall edition of the Open Source Language Summit in Pune, India on November 18 and 19, 2013.

Members from the Language Engineering, Mobile, VisualEditor, and Design teams of the Wikimedia Foundation joined participants from Red Hat, Google, Adobe, Microsoft Research, Indic language projects, Open Source Projects (Fedora, Debian) and Wikipedians from various Indian languages. Google Summer of Code interns for Wikimedia Language projects were also present. The 2-day event was organised as work-sessions, focussed on fonts, input tools, content translation and language support on desktop, web and mobile platforms.

Participants at the Open Source Language Summit, Pune India

The Fontbook project, started during the Language Summit earlier this year, was marked to be extended to 8 more Indian languages. The project aims to create a technical specification for Indic fonts based upon the Open Type v 1.6 specifications. Pravin Satpute and Sneha Kore of Red Hat presented their work for the next version of the Lohit font-family based upon the same specification, using Harfbuzz-ng. It is expected that this effort will complement the expected accomplishment of the Fontbook project.

The other font sessions included a walkthrough of the Autonym font created by Santhosh Thottingal, a Q&A session by Behdad Esfahbod about the state of Indic font rendering through Harfbuzz-ng, and a session to package webfonts for Debian and Fedora for native support. Learn more about the font sessions.

Improving the input tools for multilingual input on the VisualEditor was extensively discussed. David Chan walked through the event logger system built for capturing IME input events, which is being used as an automated IME testing framework available at http://tinyurl.com/imelog to build a library of similar events across IMEs, OSs and languages.

Santhosh Thottingal stepped through several tough use cases of handling multilingual input, to support the VisualEditor’s inherent need to provide non-native support for handling language content blocks within the contentEditable surface. Wikipedians from various Indic languages also provided their inputs. On-screen keyboards, mobile input methods like LiteratIM and predictive typing methods like ibus-typing-booster (available for Fedora) were also discussed. Read more about the input method sessions.

The Language Coverage Matrix Dashboard that displays language support status for all languages in Wikimedia projects was showcased. The Fedora Internationalization team, who currently provides resources for fewer languages than the Wikimedia projects, will identify the gap using the LCMD data and assess the resources that can be leveraged for enhancing the support on Desktops. Dr. Kalika Bali from Microsoft Research Labs presented on leveraging content translation platforms for Indian languages and highlighted that for Indic languages MT could be improved significantly by using web-scale content like Wikipedia.

Learn more about the sessions, accomplishments and next steps for these projects from the Event Report.

Runa Bhattacharjee, Outreach and QA coordinator, Language Engineering, Wikimedia Foundation

The Autonym Font for Language Names

When an article on Wikipedia is available in multiple languages, we see the list of those languages in a column on the side of the page. The language names in the list are written in the script that the language uses (also known as language autonym).

This also means that all the appropriate fonts are needed for the autonyms to be correctly displayed. For instance, an article like the one about the Nobel Prize is available in more than 125 languages and requires approximately 35 different fonts to display the names of all the languages in the sidebar.

Language Autonyms

Initially, this was handled by the native fonts available on the reader’s device. If a font was not present, the user would see square boxes (commonly referred to as tofu) instead of the name of a language. To work around this problem, not just for the language list, but for other sections in the content area as well, the Universal Language Selector (ULS) started to provide a set of webfonts that were loaded with the page.

While this ensured that more language names would be correctly displayed, the presence of so many fonts dramatically increased the weight of the pages, which therefore loaded much more slowly for users than before. To improve client-side performance, webfonts were set not to be used for the Interlanguage links in the sidebar anymore.

Removing webfonts from the Interlanguage links was the easy and immediate solution, but it also took us back to the sup-optimal multilingual experience that we were trying to solve in the first place. Articles may be perfectly displayed thanks to web fonts, but if a link is not displayed in the language list, many users will not be able to discover that there is a version of the article in their language.

Autonyms were not needed just for Interlanguage links. They were also required for the Language Search and Selection window of the Universal Language Selector, which allows users to find their language if they are on a wiki displaying content in a script unfamiliar to them.

Missing font or “tofu”

As a solution, the Language Engineers came up with a trimmed-down font that only contains the characters required to display the names of the languages supported in MediaWiki. It has been named the Autonym font and will be used when only the autonyms are to be displayed on the page. At just over 50KB in size, it currently provides support for nearly 95% of the 400+ supported languages. The pending issues list identifies the problems with rendering and missing glyphs for some languages. If your language misses glyphs and you know of an openly-licensed font that can fill that void, please let us know so we can add it.

The autonym font addresses a very specific use case. There have been requests to explore the possibility of extending the use of this font to similar language lists, like the ones found on Wikimedia Commons. Within MediaWiki, the font can be used easily through a CSS class named autonym.

The Autonym font has been released for free use with the SIL Open Font License, Version 1.1.

Runa Bhattacharjee, Outreach and QA coordinator, Language Engineering, Wikimedia Foundation

Get introduced to Internationalization engineering through the MediaWiki Language Extension Bundle

The MediaWiki Language Extension Bundle (MLEB) is a collection of MediaWiki extensions for various internationalization features. These extensions and the Bundle are maintained by the Wikimedia Language Engineering team. Each month, a new version of the Bundle is released.

The MLEB gives webmasters who run sites with MediaWiki a convenient solution to install, manage and upgrade language tools. The monthly release cycle allows for adequate testing and compatibility across the recent stable versions of MediaWiki.

A plate depicting text in Sanskrit (Devanagari script) and Pali languages, from the Illustrirte Geschichte der Schrift by Johann Christoph Carl Faulmann

The extensions that form MLEB can be used to create a multilingual wiki:

  • UniversalLanguageSelector — allows users to configure their language preferences easily;
  • Translate — allows a MediaWiki page to be translated;
  • CLDR — is a data repository for language-specific locale data like date, time, currency etc. (used by the other extensions);
  • Babel — provides information about language proficiency on user pages;
  • LocalisationUpdate — updates MediaWiki’s multilingual user interface;
  • CleanChanges — shows RecentChanges in a way that reflects translations more clearly.

The Bundle can be downloaded as a tarball or from the Wikimedia Gerrit repository. Release announcements are generally made on the last Wednesday of the month, and details of the changes can be found in the Release Notes.

Before every release, the extensions are tested against the last two stable versions of MediaWiki on several browsers. Some extensions, such as UniversalLanguageSelector and Translate, need extensive testing due to their wide range of features. The tests are prepared as Given-When-Then scenarios, i.e. an action is checked for an expected outcome assuming certain conditions are met. Some of these tests are in the process of being automated using Selenium WebDriver and the remaining tests are run manually.

The automated tests currently run only on Mozilla Firefox. For the manual test runs, the Given-When-Then scenarios are replicated across several web browsers. These are mostly the Grade-A level supported browsers. Regressions or bugs are reported through Bugzilla. If time permits, they are also fixed before the monthly release, or otherwise scheduled to be fixed in the next one.

The MLEB release process allows several opportunities for participation in the development of internationalization tools. The testing workflow introduces the participants to the features of the commonly-used extensions. Finding and tracking the bugs on Bugzilla familiarizes them with the bug lifecycle and also provides an opportunity to work closely with the developers while the bugs are being fixed. Creating a patch of code to fix the bug is the next exciting step of exploration that the new participants are always encouraged to continue.

If you’d like to participate in testing, we now have a document that will help you get started with the manual tests. Alternatively, you could also help in writing the automated tests (using Cucumber and Ruby). The newest version of MLEB has been released and is now ready for download.

Runa Bhattacharjee
Outreach and QA coordinator, Language Engineering, Wikimedia Foundation

Restoring the forgotten Javanese script through Wikimedia

There are several confusing and surprising things about the Javanese language. First, a lot of people confuse it with Japanese, or with Java, a programming language. Also, with over eighty million speakers, it is one of the ten most widely spoken languages in the world, yet it is not an official language in any country or territory.

Illuminated manuscript of Babad Tanah Jawi (History of the Javanese Land) from the 19th century.

Javanese is mainly spoken in Indonesia, on the island of Java, which gave its name to a popular variety of coffee. The only official language of that country is Indonesian, but Javanese is the main spoken language in its area. It is used in business, politics and literature. In fact, its literary tradition goes back to the tenth century, when an encyclopedia-like work titled Cantaka Parwa was written in it. Another Javanese encyclopedia was published in the nineteenth century, titled Bauwarna.

This tradition is being continued today by Wikipedians who speak that language: every day they strive to improve and enhance the Javanese Wikipedia, now having over forty thousand articles. One of them is Benny Lin. In addition to writing articles and explaining to people the Wikipedia mission, Benny’s special passion is making the Javanese language usable online not just in the more prevalent Latin alphabet but also in the ancient Javanese script.

This ancient script also known as Carakan was used for over a thousand years, and numerous books have been published in it. These days there’s little book publishing in it, though it is still used in some textbooks, in some Facebook groups and in public signs. Elsewhere the Latin alphabet is used more frequently. The younger generation is starting to forget the old script and this rich heritage becomes inaccessible. Benny hopes that transcribing classical literature for Wikisource and writing modern encyclopedic articles in this script, will revive interest in it and help the Javanese people achieve greater understanding of their own culture, and make these largely unknown treasures of wisdom accessible to people of all languages and cultures.

Javanese Wikipedia article about Joko Widodo

Benny presented a talk about this at Wikimania in Hong Kong, the international gathering of Wikipedians. There he also worked with Santhosh Thottingal and myself, developers from Wikimedia’s Language Engineering team, to improve the support for the Javanese script in Wikipedia. Thanks to this work, Wikipedias in all languages can now show text in the Javanese script, and the readers don’t have to install any fonts on their computers, because the fonts are delivered using webfonts technologies. The exquisite Javanese script has many ligatures and other special features, which require the Graphite technology for displaying. As of this writing, the only web browser that supports it is Firefox, but Graphite is Free Software, and it may become supported in other browsers in the near future.

Benny also completed his work for Javanese typing tools for Wikipedia, so now the script can not only be read, but also written easily. This technology can even be used on other sites and not just Wikipedia, using the jquery.ime library.

He sees his work as part of a larger effort by many people who care about the script. There are others, who design fonts, promote the script in different venues and research its literature. Beeny saw that he could contribute by making the fonts and typing tools more accessible through Wikipedia, and he just did it.

Wikimedians believe that the sum of all knowledge must be freely shared by all humans, and this means that it must also be shared in all languages. Passionate volunteers like Benny are the people who make this happen.

Amir E. Aharoni
Software Engineer, Language Engineering team

Join the Language Engineering team at Wikimania in Hong Kong next week!

Wikimania 2013 Hong Kong

Wikimania, the largest gathering of the Wikimedia world, is just around the corner. In less than a week, hundreds of volunteers will be gathering in Hong Kong to share stories and talk about grand plans for the year ahead. The Language Engineering team of the Wikimedia Foundation also plans to join in! The team has been working consistently on improving internationalization support in MediaWiki tools and in Wikimedia projects. At Wikimania 2013, the team will be present to talk and exchange ideas about these projects.

Team members will be presenting technology sessions, organizing a translation sprint and showcasing at the DevCamp. At the Translation sprint workshop, participants will get a quick introduction to using the Translate extension on Wikimedia wikis and translatewiki.net. The Translate extension provides MediaWiki with essential features needed to do translation work. It can be used to translate simple content, wiki user interfaces and system messages.

In an interesting design session, our team’s interaction designer Pau Giner will be presenting on the challenges and complexities of designing language-conscious user interfaces. In the talk titled Improving the user experience of language tools, he will look at two significant projects developed by the team: the Universal Language Selector (ULS) and Translate UX as case studies.

Team engineers Amir Aharoni and Niklas Laxström will discuss multilingual feature enhancement possibilities for handling multimedia content meta-information on Wikimedia Commons. Multilingual Wikimedia Commons – What can we do about it? To find an answer, they will touch upon the advanced features in Translate, ULS and other tools that can be used to translate images, templates, descriptions and other graphics metadata to make Commons a truly multilingual Wiki.

A technical session, MediaWiki i18n getting data-driven and world-reusable, on improvements in MediaWiki internationalization (i18n) will be presented by Santhosh Thottingal and Niklas Laxström. They will be talking about the preference of using data-driven approach in place of custom code which helps code maintainability across multiple i18n frameworks. An example is data usage from the Unicode Common Locale Data Repository (CLDR). They will be speaking about the challenges and benefits of maintaining two code bases MediaWiki JavaScript i18n extension and the jquery.i18n library for different developer audiences.

Language team members will be part of panel discussions covering WMF Agile practices and at the Ask the Developers session. Come meet us at any of these sessions and bring your toughest language software questions along!

Complete list of Language Engineering sessions.

Runa Bhattacharjee
Outreach and QA coordinator, Language Engineering, Wikimedia Foundation

Updates from the Language Engineering Google Summer of Code projects

Google Summer of Code (GSoC) 2013 is well underway on the coding phase. Four projects this year are related to various aspects of MediaWiki and Wikimedia internationalization initiatives. On completion of these 4 projects, we expect to present:

These projects are being mentored by members of the Wikimedia Language Engineering team along with members from the WMF Mobile and VisualEditor teams. In this post, we touch base with each of the projects about the challenges that they have faced so far and on their accomplishments.

MediaWiki VisualEditor internationalization and right-to-left languages support

A screenshot of a draft version of the VisualEditor language inspector.

Moriel Schottlender is working on adding better support for non-English languages to the VisualEditor. Her project consists of two main parts. The first is triaging and fixing bugs in handling right-to-left text reported by volunteer editors who test the currently deployed version of the editor in languages such as Arabic and Hebrew. She has already fixed several bugs such as moving the cursor in the correct direction using the arrow keys and adapting the design of VisualEditor’s dialog boxes to right-to-left layout. The other part of the project is developing a “language inspector”–a tool for setting the language of a piece of text in an article. This is needed very frequently in Wikipedias in all languages to set properties such as font, size or direction of a foreign name or quotation. Nowadays it is done using a multitude of templates and HTML tags, and Moriel’s project will make it easy and unified.

jQuery.IME extensions for Firefox and Chrome

Part of Project Milkshake, jQuery.ime is an input method library. Making a good start, Praveen Singh has already implemented working jQuery.ime extensions for both Chrome and Firefox. Rather than loading all input methods at once, input method scripts are loaded only when the user selects a particular input method. The jQuery.ime and jQuery.uls upstream projects were added as git submodules in the extensions. This will provide a way to synchronize the extensions with the upstream projects in the future by simply updating the respective submodule. Universal Language Selector (ULS) has been successfully integrated in the extensions, thus providing the users with an easy way to choose among different languages. The extension remembers a user’s most recently selected languages and their corresponding input methods, and offers an easy way to choose among those languages.

Language Coverage Matrix Dashboard

The Language Coverage Matrix dashboard was a project conceived during the last Open Source Language Summit held in Pune, India earlier this year. The project began as a shared spreadsheet and over the next few months, the data was filtered to uniquely identify the internationalization support status for each language and its variants. The dashboard will provide an interface to search this data and present visual data representations. The test instance set up by Harsh Kothari on wmflabs, showcases the search features. The spreadsheet data has been ported to a MySQL database. Plans for the coming few weeks include additional query implementation, code refactoring and start of the visualization representations.

Phone app for MediaWiki translation

The Translate extension is used for translating MediaWiki content. This project will bring the convenience of this widely used extension as an Android app. The iPhone app created by the student Or Sagi, is already available for download and testing. The similarly designed Android app will provide features for translation and proofreading. Due to conflicts with examination schedules, major work on this project will effectively start from late July.

Getting more updates

More details about the progress of each of these projects can be found on the project home pages. The students and mentors also meet up every week for demos. Please let them know if you’d like to be part of any of these sessions.

Runa Bhattacharjee
Outreach and QA coordinator, Language Engineering, Wikimedia Foundation