How the Odia Wikimedia community is enriching Wikipedia with character encoding technology

Translate This Post

Odia language input in Odia Wikipedia.png
Accessing sources in the Odia language is becoming easier, thanks to new character encoding converters. Photo by Subhashish Panigrahi, freely licensed under CC-BY-SA 4.0.

Last year, I wrote about the development of Odia language character encoding converters that the Wikipedia community was working on. These converted text into Unicode, a universal encoding standard. These converters have now been made available for use, just in time for today’s thirteenth anniversary of the Odia Wikipedia.

Character encoding is used to represent a collection of characters through an encoding system, and is used in computation, data storage, and transmission of textual data. Fonts in different scripts used to have several different encoding systems before the onset of Unicode.

However, most media outlets—as well as the state government—are still using old encoding systems for Odia. These require the installation of a particular font using the same encoding system to read documents. Unicode makes this much easier, as most modern computers come with Unicode fonts preinstalled.

A character encoding converter is generally used to convert from one encoding system to another. Massive amounts of content, not archived on a regular basis, could now be converted to Unicode and, in turn, provide Wikipedia editors with easily accessible sources to create new articles and enhance existing ones.

The Odia language is spoken by over 40 million people in eastern India, accross various Indian cities, and by expatriates abroad. It is one of the oldest languages in South Asia, and is recognized as a “classical language” by the Indian government. The Odia Wikipedia celebrates its thirteenth anniversary today, June 3.

Wikimedian and developer Jnanaranjan Sahu receives an award at the Odisha Dibasa 2014 celebration. Photo by Biswadeep Mishra, CC-by-SA 3.0.

However, the “classical language” status has not yet boosted knowledge production or use of the language on the Internet. Almost all online newspapers and state publications, such as Odia-language journals, public announcements, and information portals, host their content in various legacy character encodings that do not allow users to easily access and share information. This has, unsurprisingly, proven to be a major hurdle for the small Odia Wikimedia community, who hope to enrich their project with Odia-language citations.

To help solve these problems, the community tried using two encoding converters. These were previously developed by friends from Srujanika, a non-profit based in Bhubaneswar, Odisha, that works on promoting science education in school curricula in the Odia language, as well as the digitization of early Odia literature. These converters became the building blocks on top of which Wikimedian Manoj Sahukar built converters after massively rewriting their code. I was also part of the re-building process, from the initial development of the converters to the design of their interface, and I helped to design handouts teaching new users how to use them.

The community played a major role in promoting the converters on social media. An op-ed in the Odia newspaper Samaja helped to reach out to more people unaware of the uses for Unicode. Many Internet users did not realize that they had been sharing knowledge on their blogs or social media using various legacy encodings, which neither appear in search engines nor allow anyone to share them in an accessible way.

By converting news and articles from newspapers and magazines as test cases, the converters were improved over time. Citing Odia-language sources wasn’t so easy before: making use of any content from a local newspaper could take hours.

From September 2014 to March 2015, a small project ran to convert text from several newspapers and magazines, so that they could be used as citations in articles in the Odia Wikipedia — this is important because, when these sources are not available in Unicode, search engines and Wikipedia users can have difficulty finding them.

A conversation with developer and Wikimedian, Manoj Sahukar, about designing an encoding converter for Odia. Audio recording by Subhashish Panigrahi, freely licensed under CC BY-SA 3.0.

Because the converters were hosted separately on Google Drive, it was difficult to have them all in one central place. Odia Wikipedians wanted their Wikipedia to host a single converter, where a user could select the appropriate input encoding. Wikimedian and developer Jnanaranjan Sahu came up with a responsive, wiki-based converter that went live on May 12 and is now available for use. The converter now enables the choice of source encoding from a drop-down menu, and converts the input into Unicode. Issues with this conversion process can be reported via Google Spreadsheet.

Combining five different converters into one, Jnanaranjan says, was a necessary next step in development: “When I found that there are different URLs for different converters, and that the URLs lead to a bunch of different sites, it seemed quite messed up. It would have been difficult for users to locate each of the converters. I thought it would be easier for users if they could find all the encoding converters for Odia on one page on their home wiki. So, I tried to tweak the source code and design this converter.”

He also explains that several newspapers whose news is encoded in older systems are now rich information sources. “Converting them and using the information to add more citations to Wikipedia could help to achieve the dream of every single person being able to contribute more information to the Odia Wikipedia,” he says, “so all human knowledge may be available in our language.”

Subhashish Panigrahi
Wikimedian and Programme Officer
Centre for Internet and Society, Bengaluru, India.

Archive notice: This is an archived post from blog.wikimedia.org, which operated under different editorial and content guidelines than Diff.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

2 Comments
Inline Feedbacks
View all comments

I gave up on a problem similar to this in 2013. At that time, I was trying to parse through an archive of old telugu and urdu newspapers. Both of them used a font that used its own custom encoding.
The problem was further exacerbated as the documents were stored as pdfs. My be its time to go back.

[…] these projects has been also creating many tools and technical resources like editor manuals. The script encoding converters that the community has built is helping a lot of online users to be able to share their writings […]