Wednesday, February 28, 2007

What is a Word?

First, let me make it perfectly clear that this is a discussion that has raged for centuries. I know full well that everybody has his or her own opinion on this matter, and that I am not going to resolve this issue today. This is an overview and a bit of personal opinion, as relates to online dictionaries.

Intuitively, we all know the answer. A word is a unit of language conveying some meaning. But how do we decide what is a real word? We look in a dictionary, of course. What do we do if we're writing a dictionary?

We are caught between cataloging what is "right" (prescriptivism) and what is actually done (descriptivism). The pendulum has lately swung towards descriptivism, and I would say that there are some good reasons for that trend. The language that is spoken on the streets is not the same language that is written in academia. Somebody learning a language may genuinely need help sorting out the less proper terms in it.

Take, for instance, colloquialisms such as "irrespective" and "humongous", and all the phrases that have gotten squished together into amalgams like "gotcha" and "woulda". Most people would readily agree that these words do not belong in a college thesis paper.

What is the scrupulous lexicographer to do? Fortunately, it is not a strict either-or question, especially in a work not substantially limited by size. In an electronic resource, we can put them in, anyway. To satisfy the formal sorts, the perscriptivists, we can then place a prominent usage note in the entry, explaining just why a writer might wish to use caution with the term: ginormous is a colloquial term, regarded by many to be something less than a proper word. Thus, the reader is both informed and cautioned.

That's fine for most of the slang and jargon, but we have another problem. People keep making up new words. My sister in law coined the term "muskaroon" to mean generically any small, furry creature that scurries past too quickly to identify. Squirrels, chipmunks, gophers, and presumably rabbits would all qualify. So we have a unit of language with a symbol and a meaning. The trouble is, if you walked up to people on the street and inquired whether there were muskaroons in the area, nobody would be able to answer who hadn't talked lately to my sister in law, and that is a small minority of people, indeed.

The test here is usage. Can we demonstrate that the word is in common use? Now, depending on the character of the dictionary, we can define the rules various ways. Was it used by so many independent sources? Did anybody important (such as Shakespeare or a prominent academic journal) publish the word?

Generally, we also try to find and present examples of the term in what is called "running text". That means that it is in a paragraph, and isn't only used as somebody's nickname, say. The edge is still a fuzzy one. Are the citations in traditional print sources, such and books and journals, or are they sprinkled in a couple blogs and forums? Was the word used in only one limited context, or in a variety of sources and over a period of years? These sorts of tests can help to weed out many of the more questionable entries. At some point, though, it may yet come down to a judgment call, if not on whether a word is real, then on how to apply the rules. In these cases, I advise the users of a dictionary to bring a healthy dose of skepticism with them, to recall that even dictionaries are not infallible, and to trust at the very least that these decisions are made by real people who care for the project.

If, knowing all that, you find you don't like the way "they" are running the place, you are invited to do a better job.

Monday, February 19, 2007

Anyone may edit

Written in response to this project.

Somebody asked me today about what happens to a dictionary when anybody can edit it. As anybody who has ever edited a wiki knows, the openness is a mixed blessing.

It is a great thing, because many hands make light work. Dictionaries need to be every bit as large as the languages they catalog, so the process of gathering and maintaining the data is a huge one. As we start to add translations between languages, rather than simply defining a term, that task becomes orders of magnitude bigger. To capture all words in all languages is something that will take nothing less than a wiki and a worldwide community. It is a monumental task, but in a wiki, we can conceive of creating a resource on such a scale.

It is a great thing because so many regions and cultures can be represented. An American may understand most of the English spoken in South Africa or New Zealand, but both of those regions have slang all their own. Chile speaks Spanish far differently than Spain. All those variants can have their place.

It is a great thing because a wiki can evolve with a language. New terms come into use all the time, and a freely editable electronic resource is not limited in its capacity to store data or to accommodate a large, diverse set of editors.

The big trouble is this: if just anybody can edit, how on earth do we know it is right? I'd like to explore a few approaches here. Of course, any of these approaches could be considered as a barrier to entry, but these things are always trade-offs.
  1. Appoint trusted users to do the housekeeping. These are the sysops, administrators, bureaucrats, the librarians, or the janitors, depending on your point of view. These somebodies keep watch and undo the damage that some of the just-anybodies can do. If somebody writes an article containing typical vandalism, such as "asdfasdf" or "Dave is a dork!", an administrator can delete or undo it. Much vandalism is so predictable that even a bot can detect and remove it. Unfortunately, a select group of administrators, however well trusted or well-read, cannot be everywhere at once, and they cannot know everything. Things get missed, even with a checklist system such as patrolled edits. It is likewise impossible for an administrator or small group of administrators to know everything. Misinformation, intentional or otherwise, is not so easy to spot as out-and-out nonsense.
  2. Hold people accountable. Articles have histories, so you can see who did what. Even pseudonymous users develop reputations. Anonymous users tend to attract the most scrutiny. An active, healthy wiki often develops into a meritocracy, with leaders having sway (though not necessarily authority) based on reputation, seniority, and trust in the community. This effect generally works to improve content, but even a well-known, trusted user may make mistakes. If he or she is trusted well enough, there is a risk that an error or oversight may go unnoticed.
  3. Allow anybody and everybody to scrutinize and correct or flag the content. The process is not foolproof, especially in larger projects, but wikis have a remarkable capacity for self-cleaning. Of course, this approach can tend to result in a sort of groupthink effect: if enough people believe it, then it must be so.
  4. Demand credentials. Don't just let in any old riffraff. Wikipedia has clearly shown the power of amateurs and volunteers to create great content, but it is certainly possible to limit the users in a project, or part of the project, to a certain group. This approach is most appropriate to a wiki serving a closed community, such as a professional or academic group, especially one dedicated to a particularly narrow or specialized topic.
  5. Make the messes behind the scenes, and publish only the good stuff, with some review process. The German Wikipedia published a paper book containing selected articles. Online, there have been proposals for a "Stable Versions" system, where a mature article would be reviewed and locked, and any additional changes would go through a separate editing or discussion page.
  6. Demand references. There is a movement within Wikipedia to reference the articles and the claims made in them. In the context of a dictionary, references may be other dictionaries. Is the word recognized by RAE or OED (whom we trust to have done the requisite homework)? They may be other works about words. Or, they may be citations. Citations are quotations including the word in question. They show context and provide evidence that the word is or was in use. Of course, we must still question the validity of the evidence. Are the 400 Google hits because somebody prolific uses that nonsense word as a handle? Is a word more valid if it was used by a blogger or two, or by Thornton Wilder? Is an etymology known with reasonable certainty or is it apocryphal? Depending on the size and resources of the wiki, efforts to verify and reference articles may be systematic, or they may be requested when a given entry or fact is questioned.
A wiki is simply a website where anybody can post. With a bit of care and attention, its content can be as valid and accurate as any other reference, and certainly more complete and up-to-date.

Friday, February 16, 2007

An article in Nature ...

I am absolutely thrilled with the article that was published last Wednesday in Nature... It is a great article and it explains really well what we hope to achieve with relational data in MediaWiki. The only thing that is a bit sad is, that you have to pay $30 for the privilege of reading it.

The article is great, and what makes it special is the great presentation that Knewco has created to explain what we hope to achieve; this demo available at It presents some really impressive figures; it indicates the work done to integrate several important resources of the bio-medical domain, the numbers involved.

For me the most important point is that this is likely to be a very important stimulus to the Open Access movement. It indicates that it is possible to bring what was divided together. It allows people to work with the terminology of their field and also add data that is very specific. Information that goes much further than what was envisioned in what was once called the "Ultimate Wiktionary".

The whole notion of a resource that because of its roots already merged lexicology, terminology and ontology is really special. With the integration of such specialised data from different domains like the bio-medical, another really interested experiment will be under way when the data gets imported and merged. There is a nascent community for the bio-medical domain and, it will find that it will co-exist with the existing OmegaWiki community.

Both communities have everything to gain from collaboration; much of what the existing OmegaWiki community cares about will be seen as a fringe benefit. On the other hand, the translations that exist for concepts like malaria will prove to be of value when scientific articles are considered that were not published in English.

I am convinced that a bright future is ahead of us. We have this vision of what may come, I wish I could look into the future and see what it will be like. :)


Sunday, February 11, 2007

Why compete when you can collaborate ?

All words of all languages of the world.. that is what we eventually aim to include in OmegaWiki. This aim is of such a magnitude that you have to be certifiable to come up with such a project. The functional design for the project includes much more; everything including the kitchen sink..

When everything is to be included in one project, it is easy to suggest that people contribute to the project. When the project includes everything why have another?

In an Open Source / Open Content environment this is not necessarily how it works. Why should the others be seen as competitors? They do their own thing, sure. You may want to achieve the same thing, also true. It is however much possible to find the synergy between projects. This way you can build on each others accomplishments.

The Shtooka project is something I learned about the other day. The one thing it does really well is the way they make recording pronunciations easy. You can record a string of words and it will save them for you one at a time.

Wiktionarians saw this and they are working an upload facility so that it will also be saved automatically to Commons. I warned that the files should not only be saved as .ogg files. In order to make sure they are relevant for scientists there should also be a .wav file. The current thinking is that the flac file format will work as well and the benefit is that it provides a loss-less compression. To make sure that this is the case, the praat software, software that is also available under a GPL license, was analysed and it was considered that it is easy to incorporate this flac file format.

People from effectively five different communities are now working together. It will be even possible to include links to OmegaWiki in the Shtooka meta data. This will be possible even though both projects do their own thing. Both the data and the functionality can be shared.

I may be certifiable, but this kind of collaboration is awesome and, it is why there may be method to this madness.. :)


Thursday, February 08, 2007

Become an OmegaWiki developer

OmegaWiki is now running the latest version of the MediaWiki software used by Wikipedia. This is a major milestone, as it also makes it a lot easier for anyone to join in the fun of developing the open source OmegaWiki/Wikidata software. To give credit where credit is due, these are the people who have contributed to the code so far:
  • Peter-Jan Roes
  • Karsten Uil
  • Sean Burke
  • Rod A. Smith (sticky tree expansion via cookies)
  • Ævar Arnfjörð Bjarmason (namespace code installer)
  • Charles Pritchard (Multilingual MediaWiki development, ongoing)
  • Jelte Zeilstra (untranslated meaning script, under review)
  • Zdenek Broz (statistical scripts, under review)
  • Paa-Kwesi Imbeah (Wikimedia Commons support, under review)
  • Marc Carmen (TBX export, incomplete)
  • myself
There are probably others I forgot. These people, some of them volunteers, some paid developers, are helping to build the first truly multilingual, massively collaborative ontology. If you want to become a part of this history, there are now instructions that should help you get on your way. Please contact me under erik AT openprogress DOT org once you have read and followed these instructions. There are always plenty of things that need doing. And as the organization which runs OmegaWiki, Stichting Open Progress, develops more and more partnerships around the project, we will look to our team of existing developers to help us implement them.