Tuesday, December 19, 2006

The use of Standards to emancipate languages

More and more languages are supported by the localisation efforts of projects like Open Office. This has a major effect on the emancipation of these languages; much content is written and some of it ends up on the Internet. When it gets there, its information is often lost because it cannot be found by a user who is not sophisticated in the use of search engines. Sophistication is needed because Google currently only recognises 100 languages and only 15% of the content of the World Wide Web is tagged to indicate a language and much of it is tagged incorrectly. When the quality of the tagging is improved, it will be possible for search engines to provide information that is only in the requested language. This will have the added benefit that a growing corpus of content will become available and this in turn will stimulate the research in these languages.

OmegaWiki is a wiki based website that aims to provide information both of a lexical, terminological and ontological nature. It does this by extending the MediaWiki software with relational functionality. OmegaWiki aims to have all words in all languages.

In order to learn what languages exist, OmegaWiki adopted the ISO-639-3 standard. This leaves out many linguistic entities like dialects and orthographies. OmegaWiki has had the good fortune that it got into contact with the WLDC and GeoLang who are the organisations that deal with the ISO-639-6 standard that is under development. Together with the WLDC it has been proposed to the ISO task group to use the environment and the functionality that OmegaWiki provides to gather the data that is needed to learn about the different linguistic entities. This has been accepted by the task force at the LSGB conference in Vienna.

Much of the groundwork has now been done, the next step is to make this functional. To make it functional, we want to have software adopt the existing standards and the information provided in OmegaWiki. This is feasible for Open/Free software. We have already approached the OmegaT lead developer and, he will be happy to support this because it will make OmegaT more relevant. Because of what OmegaWiki aims to do, we will be able to build spell-checkers for linguistic entities on a regular basis, this in turn will have an impact on the standardisation and the emancipation of languages.

With the emancipation of languages, it will become increasingly an option to bring information to people in their native languages. Studies have shown that this leads to a much better understanding and appreciation of services provided. Particularly in what is called, the long tail of the language industry, there has been little support for any of the standards. Consequently the quality of service provided when translating for languages in the long tail are inconsistent. By developing the tools to include the support for any linguistic entities, it will become possible to use these languages for written communications, it will also become possible to raise the quality of these communications by applying the work that has been done in the language industry.

To make this project take off, it works best when many people and organisations collaborate. All of them will have their own reasons to want to be included. The challenge will be to coordinate things in such a way that all the necessary parts are realised.

There are sufficient reasons for many organisations to buy into this project. What is needed is not only to leverage this but also to explore how the data that is gathered, the functionality that is build can be extended to provide additional information that is of relevance to the project. Many organisations will have a need for information, we will want either their active participation and / or their support. We will find all types of technical complications, we need the buy in of the people that can resolve these issues.

All in all, it will take a lot of effort to provide the difference that we aim for.

No comments: