Oard and Dorr (1996) give several motivations for research into cross-language information retrieval:
for collections containing documents in many languages, where query formulation for each language would be extremely inefficient.
for documents containing text in more than one language.
for users not able to form queries in other languages, but able to make use of documents retrieved in a foreign language that contain images or names not requiring fluency.
In another article Oard (1997) points out that cross-language information retrieval would also be very helpful for those who read and write only one language, but need information that may not be available in that language.
For all of these reasons, cross-language information retrieval is an important and rapidly growing area of IR, and as such, it merits exploration. To this purpose, I provide a general overview of cross-language information retrieval, including a definition, problems involved in creating cross-language systems, basic IR approaches used, major work and projects undertaken, and possible directions for future research. This work is not meant to be comprehensive, as the field has expanded exponentially in the last decade. Rather, it is an attempt to introduce the major components of cross-language information retrieval and to summarize the actions that have been taken so far in this area.
Definitions of Cross-Language Information Retrieval
Before
delving into the unique problems that multiple languages pose to the world of IR
and the basic techniques used to make multiple-language systems functional, it
is appropriate to present a clear definition and a word about the terminology
used in the body of research about the subject. Much of the literature uses the
more common term multilingual information retrieval (MLIR) to represent all that
occurs in IR having to do with other languages. Hull and Grefenstette (1996,
p.484) give five definitions of MLIR:
IR in any language other than English.
IR on a parallel document collection or on a multilingual document collection where the search space is restricted to the query language.
IR on a monolingual document collection that can be queried in multiple languages.
IR on a multilingual document collection, where queries can retrieve documents in multiple languages.
IR on multilingual documents, i.e. more than one language can be present in the individual documents.
Another definition comes from Oard (1997, p.1), which says that MLIR is, selection of useful documents from collections that may contain several languages. This broad definition seems to encompass the last three definitions given by Hull and Grefenstette, and it is on Oard's definition that this paper is based, since the focus is on IR across languages and not on IR in monolingual settings, whatever the language. In using Oard's definition, I also commit to using the term cross-language information retrieval (CLIR), as he does, because it speaks specifically to IR work across languages, while multilingual information retrieval covers more concepts, including the first two of Hull and Grefenstette's definitions. Therefore, while many of the works and projects in the area of CLIR use the terms multilingual information retrieval or even translingual retrieval, in order to specify this overview's focus only cross-language information retrieval and its abbreviation will be used.
©2005 Jatit