Document Clustering, classification and Data Mining

It is important to note that different aspects of these approaches can be combined in an attempt to create superior CLIR systems. For example, the EMIR (European Multilingual Information Retrieval) project, which lasted from 1991 to 1994, combines MT and other IR methods, such as statistical models for weighting query-document intersections, as well as normalization of terms, grammatical tagging and a reformulation system aimed at disambiguation (Fluhr, 1996). Many approaches incorporate statistical and term vector translation techniques as well, mapping sets of TF-IDF term weights between languages (Oard and Dorr, 1996).

Major Projects

Over the past ten years or so, multiple projects have been started to explore CLIR issues, and the subject is gaining popularity at IR conferences. One of the projects, EMIR, has led to the commercial product, SPIRIT (Syntactic and Probabilistic System for Indexing and Retrieving Textual Information), which allows users to input natural language queries and retrieve documents in English, French, German and Russian (see Haddouti, 1999). Another project, MULINEX (Multilingual Indexing, Navigation and Editing Extensions for the World-Wide Web), allows searches to be filtered by language and subject area and uses automatic translation to help users understand foreign documents (see Haddouti). CANAL (Catalogue with Multilingual Natural Language Access/Linguistic Server) and TRANSLIB (Tools for Accessing Multilingual Library Catalogs) are both tools supporting multilingual access to library catalogs (see Oard, 1997). CANAL analyzes queries syntactically and semantically using recognition of compound words and translation of key words in other languages, and TRANSLIB uses MT and corpora, such as thesauri and dictionaries to give access to English, Greek, and Spanish documents. Finally, TwentyOne, a European Union project, is a tool for the dissemination of multimedia information that supports cross-language queries and partial translation of retrieved documents (see Haddouti). These are only a few of the larger projects that have been undertaken worldwide.

With regard to conferences, CLIR has grown in popularity to the point where a rapidly growing track at the TREC meetings has spun off into its own set of conferences, known as CLEF (Cross-Language Evaluation Forum) (Braschler, Peters, Schauble, 2000). CLEF was launched in 2000 to deal specifically with CLIR issues, and continues to be held today.

Directions of Future Research

As the field expands rapidly, research will continue to confront the issues specific to the basic approaches to CLIR. Resolving disambiguity and addressing normalization issues between languages will be important. The refinement and growth of multilingual thesauri, bilingual dictionaries, and corpora for retrieval will also continue. Hopefully, MT will be improved upon, and more use will be made of relevance feedback, as it provides excellent opportunities for increased performance in CLIR systems by lessening term ambiguity. It is also possible to combine different approaches, as has been shown, and new combinations or variations thereof may increase system performance as well.

next previous