Saturday, April 05, 2008

Unicode 5.1

Today I learned that Unicode 5.1 has been released. The information that I received informs me that one major feature will be of particular relevance to Japanese, Chinese and Korean texts by enabling ideographic variation sequences. The linebreaking for Polish and Portuguese hyphenation has been improved. The Indic languages will be happy with improved text segmentation algorithm.

There are 1624 new encoded characters, this includes characters required for Malayam and Myanmar but there are also new characters for the Latin script. New is support for the Cham, Lepcha, Ol Chiki, Rejang, Saurashtra, Sundanese, and Vai scripts.

For the techies, the collation algorithms have been updated to include all the new characters. This has also an effect on contractions like the ch in the Slovak language.

Many of these things have an effect on languages supported in Wikimedia projects. My question is when will we have support for this. Is this a function of the MediaWiki / PHP code and is it also a function of the browser ??


Minh Nguyễn said...

Collation relies on PHP and MySQL, though both likely default to Unicode's algorithms.

brion said...

The only thing that would really affect MediaWiki at this time would be case mappings and normalization (if any changes have been made). These can be automatically regenerated from the updated Unicode data files.