Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

Features

First Look at the Content Translation tool

The projects in the Wikimedia universe can be accessed and used in a large number of languages from around the world. The Wikimedia websites, their MediaWiki software (bot core and extensions) and their growing content benefit from standards-driven internationalization and localization engineering that makes the sites easy to use in every language across diverse platforms, both desktop and and mobile.

However, a wide disparity exists in the numbers of articles across language wikis. The article count across Wikipedias in different languages is an often cited example. As the Wikimedia Foundation focuses on the larger mission of enabling editor engagement around the globe, the Wikimedia Language Engineering team has been working on a content translation tool that can greatly facilitate the process of article creation by new editors.

About the Tool


The Content Translation editor displaying a translation of the article for Aeroplane from Spanish to Catalan.

Particularly aimed at users fluent in two or more languages, the Content Translation tool has been in development since the beginning of 2014. It will provide a combination of editing and translation tools that can be used by multilingual users to bootstrap articles in a new language by translating an existing article from another language. The Content Translation tool has been designed to address basic templates, references and links found in Wikipedia articles.

Development of this tool has involved significant research and evaluation by the engineering team to handle elements like sentence segmentation, machine translation, rich-text editing, user interface design and scalable backend architecture. The first milestone for the tool’s rollout this month includes a comprehensive editor, limited capabilities in areas of machine translation, link and reference adaptation and dictionary support.

Why Spanish and Catalan as the first language pair?

Presently deployed at http://es.wikipedia.beta.wmflabs.org/wiki/Especial:ContentTranslation, the tool is open for wider testing and user feedback. Users will have to create an account on this wiki and log in to use the tool. For the current release, machine translation can only be used to translate articles between Spanish and Catalan. This language pair was chosen for their linguistic similarity as well as availability of well-supported language aids like dictionaries and machine translation. Driven by a passionate community of contributors, the Catalan Wikipedia is an ideal medium sized project for testing and feedback. We also hope to enhance the aided translation capabilities of the tool by generating parallel corpora of text from within the tool.

To view Content Translation in action, please follow the link to this instance and make the following selections:

  • article name – the article you would like to translate
  • source language – the language in which the article you wish to translate exists (restricted to Spanish at this moment)
  • target language – the language in which you would like to translate the article (restricted to Catalan at this moment)

This will lead you to the editing interface where you can provide a title for the page, translate the different sections of the article and then publish the page in your user namespace in the same wiki. This newly created page will have to be copied over to the Wikipedia in the target language that you had earlier selected.

Users in languages other than Spanish and Catalan can also view the functionality of the tool by making a few tweaks.

We care about your feedback

Please provide us your feedback on this page on the Catalan Wikipedia or at this topic on the project’s talk page. We will attempt to respond as soon as possible based on criticality of issues surfaced.

Runa Bhattacharjee, Outreach and QA coordinator, Language Engineering, Wikimedia Foundation

Translatewiki.net in the Swedish spotlight

This post is available in 2 languages:
English  • Svenska

English

Translatewiki.net’s logo.

Most Swedes have a basic understanding of English, but many of them are far from being fluent. Hence, it is important that different computer programs are localized so that they can also work in Swedish and other languages. This helps people avoid mistakes and makes the users work faster and more efficienttly. But how is this done?

First and foremost, the different messages in the software need to be translated separately. To get the translation just right and to make sure that the language is consistent requires a lot of thought. In open source software, this work is often done by volunteers who double check each other’s work. This allows for the program to be translated into hundreds of different languages, including minority languages that commercial operators usually do not focus on. As an example, the MediaWiki software that is used in all Wikimedia projects (such as Wikipedia), is translated in this way. As MediaWiki is developed at a rapid pace, with a large amount of new messages each month, it is important for us that we have a large and active community of translators. This way we make sure that everything works in all languages as fast as possible. But what could the Wikimedia movement do to help build this translator community?

We are happy to announce that Wikimedia Sverige is about to start a new project with support from Internetfonden (.Se) (the Internet Fund). The Internet Fund supports projects that improve the Internet’s infrastructure. The idea of translating open software to help build the translator community is in line with their goals. We gave the project a zingy name: “Expanding the translatewiki.net – ‘Improved Swedish localization of open source, for easier online participation’.” This is the first time that Wikimedia Sverige has had a project that focuses on this important element of the user experience. Here we will learn many new things that we will try to share with the wider community while aiming to improve the basic infrastructure on translatewiki.net. The translation platform translatewiki.net currently has 27 programs ready to be translated into 213 languages by more than 6,400 volunteers from around the world.

(more…)

Odia language gets a new Unicode font converter

Screenshot mock-up of Akruti Sarala – Unicode Odia converter

It’s been over a decade since Unicode standard was made available for Odia script. Odia is a language spoken by roughly 33 million people in Eastern India, and is one of the many official languages of India. Since its release, it has been challenging to get more content on Unicode, the reason being many who are used to other non-Unicode standards are not willing to make the move to Unicode. This created the need for a simple converter that could convert text once typed in various non-Unicode fonts to Unicode. This could enrich Wikipedia and other Wikimedia projects by converting previously typed content and making it more widely available on the internet. The Odia language recently got such a converter, making it possible to convert two of the most popular fonts among media professionals (AkrutiOriSarala99 and AkrutiOriSarala) into Unicode.

All of the non-Latin scripts came under one umbrella after the rollout of Unicode. Since then, many Unicode compliant fonts have been designed and the open source community has put forth effort to produce good quality fonts. Though contribution to Unicode compliant portals like Wikipedia increased, the publication and printing industries in India were still stuck with the pre-existing ASCII and ISCII standards (Indian font encoding standard based on ASCII). Modified ASCII fonts that were used as typesets for newspapers, books, magazines and other printed documents still exist in these industries. This created a massive amount of content that is not searchable or reproducible because it is not Unicode compliant. The difference in Unicode font is the existence of separate glyphs for the Indic script characters along with the Latin glyphs that are actually replaced by the Indic characters. So, when someone does not have a particular ASCII standard font installed, the typed text looks absurd (see Mojibake), however text typed using one Unicode font could be read using another Unicode font in a different operating system. Most of the ASCII fonts that are used for typing Indic languages are proprietary and many individuals/organizations even use pirated software and fonts. Having massive amounts of content available in multiple standards and little content in Unicode created a large gap for many languages including Odia. Until all of this content gets converted to Unicode to make it searchable, sharable and reusable, then the knowledge base created will remain inaccessible. Some of the Indic languages fortunately have more and more contributors creating Unicode content. There is a need to work on technological development to convert non-Unicode content to Unicode and open it up for people to use.

(more…)

VisualEditor gadgets

This post was written by two recipients of Individual Engagement Grants. These grants are awarded by the Wikimedia Foundation and aim to support Wikimedians in completing projects that benefit the Wikimedia movement. The grantees of this project work independently from the Foundation in the creation of their project.

Directionality tool. An example for useful site specific additional button to VE, which adds RTL mark

Many gadgets and scripts have been created by volunteers across Wikimedia projects. Many of them are intended for an improved editing experience. For the past few months there has been a new VisualEditor interface for editing articles. The interface is still in “beta,” so Wikipedians have not yet adapted it in a large scale. We believe there are many missing features, that if incorporated, can expand the VisualEditor user base. The known non-supported features are core features and extension features (such as timelines), but there are many unknown non-supported features – gadgets. Gadgets can extend and customize the visual editor and introduce new functionalities: to let more advanced users use more features (such as timeline), to introduce work-flows that are project specific (such as deletion proposals), or to easily insert popular templates such as those for citing sources. Since there is no central repository for gadgets, there is no easy way to tell what gadgets exist across all wikis.

Our project aims to organize this mess: improve gadgets sharing among communities and help push gadgets improvements for edit interface to VisualEditor. As part of this project we already:

  • Mapped all the gadgets (in any language) and created a list of all the gadgets in various projects, with popularity rating across projects.
  • Based on this list we selected key gadgets, and rewrote them to support the new VisualEditor:
    • Spell checker (Rechtschreibpruefung) – Spell checking for common errors. Spelling mistakes are highlighted in red while writing!
    • Reftoolbar – helps editors add citation templates to articles.
    • Directionality tool – Adds button to add RTL mark useful in RTL languages such as Arabic and Hebrew.
    • Common summaries – Added two new drop-down boxes below the edit summary box in save dialog with some useful default summaries.
  • Based on our experience with writing VE gadgets, we created a guide for VE gadgets writers, which should help them extend the VisualEditor with custom features. We hope it helps develop support for Visual Editor by making it more integrated with existing tools.

 

(more…)

MediaWiki localization file format changed from PHP to JSON

Translations of MediaWiki’s user interface are now stored in a new file format—JSON. This change won’t have a direct effect on readers and editors of Wikimedia projects, but it makes MediaWiki more robust and open to change and reuse.

MediaWiki is one of the most internationalized open source projects. MediaWiki localization includes translating over 3,000 messages (interface strings) for MediaWiki core and an additional 20,000 messages for MediaWiki extensions and related mobile applications.

User interface messages originally in English and their translations have been historically stored in PHP files along with MediaWiki code. New messages and documentation were added in English and these messages were translated on translatewiki.net to over 300 languages. These translations were then pulled from MediaWiki websites using LocalisationUpdate, an extension MediaWiki sites use to receive translation updates.

So why change the file format?

The motivation to change the file format was driven by the need to provide more security, reduce localization file sizes and support interoperability.

Security: PHP files are executable code, so the risk of malicious code being injected is significant. In contrast, JSON files are only data which minimizes this risk.

Reducing file size: Some of the larger extensions have had multi-megabyte data files. Editing those files was becoming a management nightmare for developers, so these were reduced to one file per language instead of storing all languages in large sized files.

Interoperability: The new format increases interoperability by allowing features like VisualEditor and Universal Language Selector to be decoupled from MediaWiki because it allows using JSON formats without MediaWiki. This was earlier demonstrated for the jquery.18n library. This library, developed by Wikimedia’s Language Engineering team in 2012, had internationalization features that are very similar to what MediaWiki offers, but it was written fully in JavaScript, and stored messages and message translations using JSON format. With LocalisationUpdate’s modernization, MediaWiki localization files are now compatible with those used by jquery.i18n.

An RFC on this topic was compiled and accepted by the developer community. In late 2013, developers from the Language Engineering and VisualEditor teams at Wikimedia collaborated to figure out how MediaWiki could best be able to process messages from JSON files. They wrote a script for converting PHP to JSON, made sure that MediaWiki’s localization cache worked with JSON, updated the LocalisationUpdate extension for JSON support.

Siebrand Mazeland converted all the extensions to the new format. This project was completed in early April 2014, when MediaWiki core switched over to processing JSON, creating the largest MediaWiki patch ever in terms of lines of code. The localization formats are documented in mediawiki.org, and MediaWiki’s general localization guidelines have been updated as well.

As a side effect, code analyzers like Ohloh no longer report skewed numbers for lines of PHP code, making metrics like comment ratio comparable with other projects.

Work is in progress on migrating other localized strings, such as namespace names and MediaWiki magic words. These will be addressed in a future RFC.

This migration project exemplifies collaboration at its best between many MediaWiki engineers contributing to this project. I would like to specially mention Adam Wight, Antoine Musso, David Chan, Ed Sanders, Federico Leva, James Forrester, Jon Robson, Kartik Mistry, Niklas Laxström, Raimond Spekking, Roan Kattouw, Rob Moen, Sam Reed, Santhosh Thottingal, Siebrand Mazeland and Timo Tijhof.

Amir Aharoni, Interim PO and Software Engineer, Wikimedia Language Engineering Team

Modernising MediaWiki’s Localisation Update

Interface messages on MediaWiki and its many extensions are translated into more than 350 languages on translatewiki.net. Thousands of translations are created or updated each day. Usually, users of a wiki would have to wait until a new version of MediaWiki or of an extension is released to see these updated translations. However, webmasters can use the LocalisationUpdate extension to fetch and apply these translations daily without having to update the source code.

LocalisationUpdate provides a command line script to fetch updated translations. It can be run manually, but usually it is configured to run automatically using cron jobs. The sequence of events that the script follows is:

  1. Gather a list of all localisation files that are in use on the wiki.
  2. Fetch the latest localisation files from either:
    • an online source code repository, using https, or
    • clones of the repositories in the local file system.
  3. Check whether English strings have changed to skip incompatible updates.
  4. Compare all translations in all languages to find updated and new translations.
  5. Store the translations in separate localisation files.

MediaWiki’s localisation cache will automatically find the new translations via a hook subscribed by the LocalisationUpdate extension.

Until very recently the localisation files existed in PHP format. These are now converted to JSON format. This update required changes to be made in LocalisationUpdate to handle JSON files. Extending the code piecemeal over the years had made the code base tough to maintain. The code has been rewritten with extensibility to support future development as well as to retain adequate support for older MediaWiki versions that use this extension.

The rewrite did not add any new features except support for JSON format. The code for the existing functionality was refactored using modern development patterns such as separation of concerns and dependency injection. Unit tests were added as well.

The configuration format for the update scripts changed, but most webmasters won’t need to change anything, and will be able to use the default settings. Changes will be needed only on sites that for some reason don’t use the default repositories.

New features are being planned for future versions that would optimise LocalisationUpdate to run faster and without any manual configuration. Currently, the client downloads the latest translations for all extensions in all languages and then compares which translations can be updated. By moving some of the complex processing to a separate web service, the client can save bandwidth by downloading only updated messages for specific updated languages used by the reader.

There are still more things to improve in LocalisationUpdate. If you are a developer or a webmaster of a MediaWiki site, please join us in shaping the future of this tool.

Niklas Laxström and Runa Bhattacharjee, Language Engineering, Wikimedia Foundation

Hovercards now available as a Beta Feature on all Wikimedia wikis

File:Hovercards feature in action on a wiki.png

Hovercards presents a summary of an article when you hover a link.

Have you ever quickly looked up an article on Wikipedia, gotten all caught up in the fun of reading, then suddenly realized that three hours had passed and you’re now reading a series of articles about totally different topics? The folks at xkcd certainly have. And now, we’ve got just the feature for you.

Hovercards provides the casual reader with a streamlined browsing experience. Whenever you hover over a link to another article, a short summary of the article and a relevant image is provided to you so you can make the decision about whether you want to read the full article. And, as of today, the feature is live for testing as a Beta Feature on all Wikimedia wikis.

Inspired by the Navigation Popups (NavPopups) gadget used by many of our experienced editors, Hovercards take the idea of NavPopups and modifies it to be suitable to casual readers. The design is minimalistic, presenting only the information which interests casual readers: the lead paragraph and first image of the article they’re interested in browsing to.

To enable Hovercards, simply log in, click the “Beta” link at the top right of your page and tick the box next to Hovercards. As its placement in the Beta tab suggests, the feature is still under active development, so we’d appreciate any feedback you have. You can give us feedback by writing on the talk page of the feature.

We hope you like using Hovercards!

Vibha Bamba, Dan Garry, Prateek Saxena, Nick Wilson
Wikimedia Foundation

Webfonts: Making Wikimedia projects readable for everyone

Wikimedia wikis are available in nearly 300 languages, with some of them having pages with mixed-script content. An example is the page on the writing systems of India on the English Wikipedia. We expect users to be able to view this page in full and not see meaningless squares also known as tofu. These tofu squares represent letters written in the language, but cannot be rendered by the web browser on the reader’s computer. This may happen due to several reasons:

  • The device does not have the font for the particular script;
  • The operating system or the web browser do not support the technology to render the character;
  • The operating system or the browser support the script partially. For instance, due to gradual addition of characters in recent Unicode versions for several scripts, the existing older fonts may not be able to support the new characters.

Fonts for most languages written in the Latin script are widely available on a variety of devices. However, languages written in other scripts often face obstacles when fonts on operating systems are unavailable, outdated, bug-ridden or aesthetically sub-optimal for reading content.

Using Webfonts with MediaWiki

To alleviate these shortcomings, the WebFonts extension was first developed and deployed to some wikis in December 2011. The underlying technology provides the ability to download fonts automatically to the user if they are not present on the reader’s device, similar to how images in web pages are downloaded.

The old WebFonts extension was converted to the jquery.webfonts library, which was included in the Universal Language Selectorthe extension that replaced the old WebFonts extension. Webfonts are applied using the jquery.webfonts library, and on Wikimedia wikis it is configured to use the fonts in the MediaWiki repository. The two important questions we need answered before this can be done are:

  1. Will the user need webfonts?
  2. If yes, which one(s)?

Webfonts are provided when:

  • Users have chosen to use webfonts in their user preference.
  • The font is explicitly selected in CSS.
  • Users viewing content in a particular language do not have the fonts on their local devices, or the devices do not display the characters correctly, and the language has an associated default font that can be used instead. Before the webfonts are downloaded, a test currently known as “tofu detection” is done to ascertain that the local fonts are indeed not usable. The default fonts are chosen by the user community.

Webfonts are not applied:

  • when users choose not to use webfonts, even if there exists a valid reason to use webfonts (see above);
  • in the text edit area of the page, where the user’s preference or browser settings are honored.

See image (below) for a graphical description of the process.

‘Tofu’ Detection

The font to be applied is chosen either by the name of the font-family or as per the language, if the designated font family is not available. For the latter, the default font is at the top of the heap. However, negotiating more complex selection options like font inheritance, and fallback add to the challenge. For projects like Wikimedia, selecting appropriate fonts for inclusion is also of concern. The many challenges include the absence of well-maintained fonts, limited number of freely licensed fonts and rejection of fonts by users for being sub-optimal.

Challenges to Webfonts

Merely serving the webfont is not the only challenge that this technology faces. The complexities are compounded for languages of South and South-East Asia, as well as Ethiopia and few other scripts with nascent internationalization support. Font rendering and support for the scripts vary across operating system platforms. The inconsistency can stem from the technology that is used like the rendering engines, which can display widely different results across browsers and operating systems. Santhosh Thottingal, senior engineer for Wikimedia’s Language Engineering team who has been participating in recent developments to make webfonts more efficient, outlines this in greater detail.

Checkbox in the Universal Language Selector preferences to download webfonts

A major impact is on bandwidth consumption and on page load time due to additional overhead of delivering webfonts for millions of users. A recent fallout of this challenge was the change that was introduced in the Universal Language Selector (ULS) to prevent pages from being loaded slowly, particularly when bandwidth is a premium commodity. A checkbox now allows the users to determine if they would like webfonts to be downloaded.

Implementing Webfonts

Several clever solutions are currently in use to avoid the known challenges. The webfonts are prepared with an aim to create comparatively smaller footprints. For instance, Google’s sfntly tool that uses MicroType Express for compression is used for creating the fonts in EOT format (WOFF being the other widely used webfont format). However, the inherent demands of a script with larger character sets cannot always be overridden efficiently. Caches are used to reduce unnecessary webfonts downloads.

FOUT or Flash Of Unstyled Text is an unavoidable consequence when the browser displays text in dissimilar styling or no text at all, while waiting for the webfonts to load. Different web browsers handle this differently while optimizations are in the making. A possible solution in the near future may be the introduction of the in-development WOFF2 webfonts format that is expected to further reduce font size, improve performance and font load events.

Special fonts like the Autonym font are used in places where known textlike a list of language namesis required to be displayed in multiple scripts. The font carries only the characters that are necessary to display the predefined content.

Additional optimizations at this point are directed towards improving the performance of the JavaScript libraries that are used.

Conclusion

Several technical solutions are being explored within Wikimedia Language Engineering and in collaboration with organizations with similar interests. Wikipedia’s sister project Wikisource attempts to digitize and preserve copyright-expired literature, some of which is written in ancient scripts. In these as well as other cases like accessibility support, webfonts technology allows fonts for special needs to be made available for wider use. The clear goal is to have readable text available for all users irrespective of the language, script, device, platform, bandwidth, content and special needs.

For more information on implementing webfonts in MediaWiki, we encourage you to read and contribute to the technical document on mediawiki.org

Runa Bhattacharjee, Outreach and QA coordinator, Language Engineering, Wikimedia Foundation

OpenDyslexic font now available on Polish Wikipedia

This post is available in 2 languages: Polski  • English

English

Screenshot of selecting the OpenDyslexic font

For those who suffer from dyslexia, the simple task of reading can become a monumental struggle. It can be hard to understand exactly what what it means to have dyslexia for those who don’t suffer from it, for this reason the condition can often go unaddressed. Fortunately there is hope in the form of the OpenDyslexic font.

With so much reading being done on computer screens, it is finally possible to help individuals with dyslexia. The OpenDyslexic font changes the shape of characters enough to make reading a lot easier for those who suffer from dyslexia.

Wikipedia supports OpenDyslexic for many languages, but unfortunately not for Polish. At the CEE conference in Modra, Slovakia, we learned that Polish can be supported as well. The request to enable OpenDyslexic was quickly granted and it is now fully supported. We would like to celebrate this occasion with the larger dyslexic community who we hope will benefit from this new feature on Polish Wikipedia.

Gerard Meijssen, Wikimedian

(more…)

Language Engineering Events – Language Summit, Fall 2013

The Wikimedia Language Engineering team, along with Red Hat, organised the Fall edition of the Open Source Language Summit in Pune, India on November 18 and 19, 2013.

Members from the Language Engineering, Mobile, VisualEditor, and Design teams of the Wikimedia Foundation joined participants from Red Hat, Google, Adobe, Microsoft Research, Indic language projects, Open Source Projects (Fedora, Debian) and Wikipedians from various Indian languages. Google Summer of Code interns for Wikimedia Language projects were also present. The 2-day event was organised as work-sessions, focussed on fonts, input tools, content translation and language support on desktop, web and mobile platforms.

Participants at the Open Source Language Summit, Pune India

The Fontbook project, started during the Language Summit earlier this year, was marked to be extended to 8 more Indian languages. The project aims to create a technical specification for Indic fonts based upon the Open Type v 1.6 specifications. Pravin Satpute and Sneha Kore of Red Hat presented their work for the next version of the Lohit font-family based upon the same specification, using Harfbuzz-ng. It is expected that this effort will complement the expected accomplishment of the Fontbook project.

The other font sessions included a walkthrough of the Autonym font created by Santhosh Thottingal, a Q&A session by Behdad Esfahbod about the state of Indic font rendering through Harfbuzz-ng, and a session to package webfonts for Debian and Fedora for native support. Learn more about the font sessions.

Improving the input tools for multilingual input on the VisualEditor was extensively discussed. David Chan walked through the event logger system built for capturing IME input events, which is being used as an automated IME testing framework available at http://tinyurl.com/imelog to build a library of similar events across IMEs, OSs and languages.

Santhosh Thottingal stepped through several tough use cases of handling multilingual input, to support the VisualEditor’s inherent need to provide non-native support for handling language content blocks within the contentEditable surface. Wikipedians from various Indic languages also provided their inputs. On-screen keyboards, mobile input methods like LiteratIM and predictive typing methods like ibus-typing-booster (available for Fedora) were also discussed. Read more about the input method sessions.

The Language Coverage Matrix Dashboard that displays language support status for all languages in Wikimedia projects was showcased. The Fedora Internationalization team, who currently provides resources for fewer languages than the Wikimedia projects, will identify the gap using the LCMD data and assess the resources that can be leveraged for enhancing the support on Desktops. Dr. Kalika Bali from Microsoft Research Labs presented on leveraging content translation platforms for Indian languages and highlighted that for Indic languages MT could be improved significantly by using web-scale content like Wikipedia.

Learn more about the sessions, accomplishments and next steps for these projects from the Event Report.

Runa Bhattacharjee, Outreach and QA coordinator, Language Engineering, Wikimedia Foundation