Document Clustering, classification and Data Mining

This method has limitations in that the user must create a query using only vocabulary from the thesaurus, which can lead to an inability to search for certain terms that are not included. In addition, there is a problem when the queries contain terms that contain different concepts in different languages. There is also a limit to the precision that a controlled vocabulary system can achieve because of the limited number of terms in the thesaurus (Fluhr, 1996). Haddouti (1999) points out that the larger the size of the vocabulary in the thesaurus, the less effective it becomes. Finally, a controlled vocabulary search can be difficult for a user who does not understand the way the system or the thesaurus is constructed, and assignment of index terms and construction of the thesaurus can be labor-intensive (Oard, 1997).

Dictionary-Based Approaches

This approach uses combinations of monolingual or bilingual dictionaries to provide something similar to a thesaurus (Oard and Dorr, 1996), which provides a platform for developing multilingual systems. Hull and Grefenstette (1996) used a bilingual dictionary for their CLIR experiment with French and English queries.

The dictionary-based approach typically suffers from the problems of ambiguity and a limited scope, omitting technical terminology (Haddouti, 1999), and Hull and Grefenstette (1996) noted that they had to cull out large amounts of information from the dictionary, such as parts of speech and pronunciation guides that would be harmful to the system�s performance for their research.

Latent Semantic Indexing

Another way to approach CLIR is through latent semantic indexing (LSI), which makes comparisons between sets of semantically related words (Fluhr, 1996). In LSI the principle components are thought to represent important conceptual distinctions (Oard and Dorr, 1996, p.21). This allows for retrieval that works better with actual word concept relations. Because this approach orders documents by how closely related they are semantically and therefore clarifies which specific concept a term may represent, LSI can also help to limit or clear up ambiguity problems (Davis and Dunning, 1995). Landauer and Littman (1991) (as cited in Oard and Dorr, 1996) were some of the first to work on CLIR using LSI, and their study depicts a basic approach to its use by evaluating passages from a training collection and identifying the principle components of them to make clusters of concepts. Berry and Young (1995) were able to use LSI with more success when they used more finely-grained training data, such as the first paragraph of a passage from the Bible, instead of the whole passage.

Corpora-Based Approaches

According to Haddouti (1999, p.4), the corpora-based technique, analyzes large collections of existing texts and automatically extracts the information needed. This is done by exploiting statistical information about term usage within the corpus, combined with linguistic constraints to avoid errors (Haddouti). Oard and Dorr (1996, p.18) describe this approach as a type of automatic thesaurus building where information about the relationships between terms is obtained, from observed statistics of term usage. Lin and Chen (1996) have used this approach in their research on machine learning and multilingual thesaurus construction. However, this technique ideally requires large collections of thousands of documents covering similar subjects to be made, and such collections are scarce.

next previous