Bashkir becomes the first language collated inside MediaWiki

Translate This Post
Photo by Visem, CC BY-SA 4.0.

Have you ever heard of Bashkortostan?
It’s a region of Russia, about a 1000 miles away from Moscow. Few people outside of Russia have heard of it, but inside Russia it’s quite well-known for its traditional honey and kumis industries, and tourists who visit its many rivers, forests, and mountains.
The region’s name comes from the Bashkirs—a distinct ethnic group that lives there and speaks its own language, which belongs to the Turkic family. Twelve years ago the first article in the Wikipedia in that language was written. Today the community of editors around it is among the most active Wikipedia communities in languages of Russia.
That community recently asked the MediaWiki software developers to solve a technical problem for them: Category collation in the Bashkir alphabet. Put simply, “collation” is the process of sorting words according to the alphabet. It’s not as simple as it may sound, and it works slightly differently in every language.

Photo by Amir Aharoni, public domain.

Bashkir is written in the Cyrillic alphabet, like Russian, but with several additional letters for special Bashkir sounds. These letters have their places all along the alphabet, but MediaWiki showed all of them incorrectly. For example, in the “Capitals of republics of Russia” category, the entry for Ufa (Өфө), Bashkortostan’s capital, was appearing in the end of the list, even though it was supposed to be in the middle.
MediaWiki software relies on an external library called ICU—International Components for Unicode— to apply collation to different languages. It has collation information about many languages, but not all, and Bashkir is not one of them.
I submitted a request to get this language into ICU, but the process of getting a new language into it can take many months, if not years. We could have just waited for that to happen, but then our colleague Brian Wolff wrote some brilliant code that resolves this issue inside MediaWiki’s code, making it unnecessary to wait for the ICU to update.
When the fix was ready, I got it deployed and tested on the Bashkir Wikipedia. And when this started working, the Bashkir Wikipedians were so happy about it that the biggest Bashkir newspaper, simply called Bashkortostan, got interested, and published a story about it.
And yes, it mentions Brian Wolff. Search for “Брайан Вулфф”. (Bashkir is not supported by Google Translate, but it is supported by Yandex.Translate. Machine translation is never perfect, but if you’re curious, you can try using it to get an idea of what the article says.)
Bashkir is the first language for which complete collation is implemented inside of MediaWiki. I am already starting to hear requests to do something like this for other languages, and thanks to Brian’s work it will now be much easier. The fact that Bashkir was the first one shows how an active editing community which cares about its language can get things to happen.
We are doing amazing things that affect the world in ways we don’t even imagine!
Amir Aharoni, Wikimedian

Editor’s note: While Amir is an employee of the Wikimedia Foundation, this post is written in a volunteer capacity.

Archive notice: This is an archived post from blog.wikimedia.org, which operated under different editorial and content guidelines than Diff.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

4 Comments
Inline Feedbacks
View all comments

Thanks to Amir and Brian. Come to Bashkortostan at the end of summer or early autumn. We look forward to welcoming you with fresh honey. “Fly along the path that I indicated, collect nectar from flowers, this will be a medicine for people”, – said the Almighty, so the Bashkirs care about the bees.

Congratulation Bashkir and Brian! Great to know that the process is much faster now.

I understand right that these additional characters are all just single Unicode characters, which only need to be put into a different order – either by ICU or by mediawiki? For the Thai Wikipedia, the sorting within categories is much worse and needs to add the sort keys in most cases – because sorting is by consonant of the syllable, but the vocal may come before the consonant in the script as well as in the Unicode character flow. It would be great if there were a technical solution for Thai as well.

@maewnam I started https://phabricator.wikimedia.org/T176434 to track improving thai collation. At very first glance it looks like we may be able to use libicu for this. If you are up for it, it would be really helpful to have a native speaker to test out possibilities and coordinate with the thai wiki communities