Document Clustering, classification and Data Mining

General Issues with CLIR

In addition to the problems or drawbacks discussed below with regard to the different techniques used to approach the creation of CLIR systems, there are several general issues to be considered first. One issue facing those who would create working CLIR systems is the basic problem of multilingual text access. For many years, computers and Web browsers were only able to present certain character sets to users. Languages using non-Western characters or including accents or other markings, such as umlauts, etc., could not be viewed accurately. This is a basic issue that concerns CLIR because systems must be able to understand and present the characters from the languages with which they propose to work. For an in depth discussion of the progress being made toward multilingual text access, including character sets, user interfaces, HTML, XML, URL/URI, and HTTP, see Haddouti (1999).

Another problem facing CLIR is the fact that different languages vary widely in their structure. In monolingual English IR systems, stemming, �the process of conflating word variants, usually by removing letters (Croft, Broglio and Fujii, 1996, p.101) is often used to increase recall performance. However, stemming cannot be easily generalized to all languages. Spanish, for example, has many more forms of each verb than English, and many other languages have other complex structure with regard to decomposing words for stemming. There is also difficulty with normalization, or breaking down compound words in some languages, such as German (Sheridan and Ballerini, 1996), and there are also languages without clear breaks between words, such as Chinese. Clearly a CLIR system must be adapted to the characteristics of whichever languages it will use.

Various solutions have been proposed for both the stemming and decomposition of terms. One example is Croft, Broglio and Fujii�s, (1996) approach to stemming that used measures of statistical dependence based on co-occurrence of terms, where Spanish word variants were put into the same class if their measure of dependence exceeded a threshold value. Another example comes from Sheridan and Ballerini�s (1996) use of a dictionary combined with a mechanism to string word meanings together in order to get exact meaning. Because the study involved German, which includes words made up of several other words, a dictionary definition was taken for each separate part of the compound and connected by the system to form a representation equivalent to the original, longer, compound word (Sheridan and Ballerini). Croft, Broglio and Fujii also discuss a word segmentation program created by the Center for Intelligent Information Retrieval at the University of Massachusetts, which can rapidly break Chinese text into separate words. It is important to note that although these solutions have helped performance, none of them has been fully successful, and adjustments are needed.

In addition to these issues, it is even more difficult with multiple languages for IR systems to choose correct word meaning because a term that may represent one or two concepts in one language may represent several completely different concepts in another. When the system processes a query, there are often several possible meanings for the terms in that query, and unless there is a mechanism for determining which meaning is appropriate to the query, the system may retrieve irrelevant documents. This problem is accentuated in multiple language searches,where a term may have various meanings in various languages. This ambiguity of terms is one of the greatest problems in CLIR.

next previous