Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

Odia language gets a new Unicode font converter

Screenshot mock-up of Akruti Sarala – Unicode Odia converter

It’s been over a decade since Unicode standard was made available for Odia script. Odia is a language spoken by roughly 33 million people in Eastern India, and is one of the many official languages of India. Since its release, it has been challenging to get more content on Unicode, the reason being many who are used to other non-Unicode standards are not willing to make the move to Unicode. This created the need for a simple converter that could convert text once typed in various non-Unicode fonts to Unicode. This could enrich Wikipedia and other Wikimedia projects by converting previously typed content and making it more widely available on the internet. The Odia language recently got such a converter, making it possible to convert two of the most popular fonts among media professionals (AkrutiOriSarala99 and AkrutiOriSarala) into Unicode.

All of the non-Latin scripts came under one umbrella after the rollout of Unicode. Since then, many Unicode compliant fonts have been designed and the open source community has put forth effort to produce good quality fonts. Though contribution to Unicode compliant portals like Wikipedia increased, the publication and printing industries in India were still stuck with the pre-existing ASCII and ISCII standards (Indian font encoding standard based on ASCII). Modified ASCII fonts that were used as typesets for newspapers, books, magazines and other printed documents still exist in these industries. This created a massive amount of content that is not searchable or reproducible because it is not Unicode compliant. The difference in Unicode font is the existence of separate glyphs for the Indic script characters along with the Latin glyphs that are actually replaced by the Indic characters. So, when someone does not have a particular ASCII standard font installed, the typed text looks absurd (see Mojibake), however text typed using one Unicode font could be read using another Unicode font in a different operating system. Most of the ASCII fonts that are used for typing Indic languages are proprietary and many individuals/organizations even use pirated software and fonts. Having massive amounts of content available in multiple standards and little content in Unicode created a large gap for many languages including Odia. Until all of this content gets converted to Unicode to make it searchable, sharable and reusable, then the knowledge base created will remain inaccessible. Some of the Indic languages fortunately have more and more contributors creating Unicode content. There is a need to work on technological development to convert non-Unicode content to Unicode and open it up for people to use.

Akruti Sarala – Unicode Odia converter user manual

There are a few different kinds of fonts used by media and publication houses, the most popular one is Akruti. Two other popular standards are LeapOffice and Shreelipi. Akruti software comes bundled with a variety of typefaces and an encoding engine that works well in Adobe Acrobat Creator, the most popular DTP software package. Industry professionals are comfortable using it for its reputation and seamless printing. The problem of migrating content from other standards to Unicode arose when the Odia Wikimedia community started reaching out to these industry professionals. Apparently authors, government employees and other professional were more comfortable using one of the standards mentioned above. All of these people type using either a generic popular standard, Modular, or a universal standard, Inscript. Fortunately, the former is now incorporated into Mediawiki‘s Universal Language Selector (ULS) and the latter is in the process of getting added to ULS. Once this is done, many folks could start contributing to Wikipedia easily.

Content that has been typed in various modified ASCII fonts include encyclopedias that could help grow content on Wikisource and Wikiquote. All of these need to be converted to Unicode. The non-profit group Srujanika first initiated a project to build a converter for conversion of two different Akruti fonts: AkrutiOriSarala99 and OR-TT Sarala. The former being outdated and the other being less popular. The Rebati 1 converter which was built by the Srujanika team was not being maintained and was more of an orphan project. Fellow Wikimedian Manoj Sahukar and myself used parts of the “Rebati 1 converter” code and worked on building another converter. The new “Akruti Sarala – Unicode Odia converter” can convert the more popular AkrutiOriSarala font and its predecessor AkrutiOriSarala99, which is still used by some. Odia Wikimedian Mrutyunjaya Kar and journalist Subhransu Panda have helped by reporting broken conjuncts which helps in fixing all problems before publishing. Odia authors and journalists have already started using the font and many of them have regular posts in Odia. We are waiting for more authors to contribute to Wikipedia by converting their work and wikifying it.

Recently a beta version of another Unicode font converter for Shreelipi fonts based on Odia Wikipedian Shitikantha Dash‘s initial code is released. It works with at least 85 % accuracy.

Even after getting the classical status, Odia language is not being used actively on the internet like some other Indian languages. The main reason behind this is our writing system has not been web-friendly. Most of those in Odisha having typing skills, use modular keyboard and Akruti fonts. Akruti is not web-compatible as we know. There are thousands of articles, literary works, news stories typed in Akruti fonts lying unused (on the internet). Thanks to Subhashish Panigrahi and his associates, they have developed this new font converter that can convert your Akruti text into Unicode. I have checked it. It’s error-free. Now it’s easy for us to write articles online (for Wikipedia and other sites).

Yes, we are late entrants as far as use of vernacular languages on the internet is concerned. But this converter will help us to go godspeed. Lets make Odia our language of communication and expression.

Subhransu Panda, Journalist, author and publisher

Subhashish Panigrahi, Odia Wikipedian and Programme Officer, Centre for Internet and Society

Quick links:

2 Responses to “Odia language gets a new Unicode font converter”

  1. Hi Nemo, thanks for pointing this out. Before I start replying could I also suggest you to copy some sample text from an Odia newspaper from this: http://www.thesamaja.com/news_view.php?news_id=65143 and paste it into a text editor? You would see random text. That is because of the modified ASCII font they have used. If you convert them using the font converter then you would see Odia characters. It has nothing to do with OCR. It is encoding conversion. Keeping the readers in mind the technical component has been hidden behind the usability. Hope that helps.

  2. Nemo says:

    Great to work on Odia, but after reading this I still have no idea what the feature does “convert text once typed in various non-Unicode fonts to Unicode”, what does that mean? OCR of the fonts? Or you meant “non-Unicode encodings”, and this is able to do encoding detection? Or you meant that something using a representation of Odia with a small character range (ASCII?) gets mapped to Odia characters?
    The screenshot doesn’t help: the seemingly random letters there might mean it’s text with messed up encoding, or it could be a complex character map.