Chinese-language search fixes for MediaWiki

Search is an important part of any web app like a wiki, but search is harder than it looks — especially in a multilingual environment.  MediaWiki has to support not just your standard Western languages like English and Spanish, but many more with special requirements:

  • Some can be written in multiple scripts (such as Serbian in Cyrillic or Latin), and searches should match text written either way.
  • Some languages don’t use word spacing, like Chinese and Japanese. To let the search index know where word boundaries are, we have to internally insert spaces between some characters:

维基百科 -> 维 基 百 科

Then to add insult to injury, we need to fudge the Unicode characters to ensure things work reliably with older and newer versions of MySQL:

维 基 百 科 -> u8e7bbb4 u8e59fba u8e799be u8e7a791

For a long time, this word segmentation wasn’t being handled correctly for Chinese in our default MySQL search backend, so searching for a multi-character word often gave false matches where the characters were all present, but not together.

This is now fixed for MediaWiki 1.16; the intermediate query representation passed to the search backend now internally treats your multi-character Chinese input as a phrase, which will only match actual adjacent characters:

维基百科 -> +”u8e7bbb4 u8e59fba u8e799be u8e7a791″

Note that Wikimedia’s sites such as Wikipedia run on a fancier, but more demanding, search backend with a separate Java-based engine built around Apache Lucene. Sometimes we have to remind ourselves that third-party users will mostly be using the MySQL-based default, and oh boy it still needs some lovin’! :)

Categories: MediaWiki, Technology
Tags: , , ,
0 Show

0 Comments on Chinese-language search fixes for MediaWiki

Leave a Reply

Your email address will not be published. Required fields are marked *